SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Accelerating the ML lifecycle
with an enterprise-grade Feature Store
Mike Del Balso, Tecton
Geoff Sims, Atlassian
Spark+AI Summit
Wednesday, 6/24 3:40pm PDT
1
2
Product Manager / Ads ML
Product Manager / Created Michelangelo Platform
Co-founder and CEO / Data Platform for ML
2
About Us
Mike Del Balso
Principal Data Scientist
Geoff Sims
Machine learning today often falls short of its potential
Limited predictive data Long development cycles Painful path to Production
3
Fraud detection
CTR Prediction
Pricing
Customer support
Recommendation
Search
Segmentation
Chat bots
Common Use CasesOperational ML
powers automated
business decisions and
customer experiences
at scale
Characteristics
4
Customer-facing impact
Time-sensitive
Production SLAs
Subject to Regulation
Cross-functional stakeholders
Data Collection
Data Verification
Feature Engineering
Testing and
Debugging
Resource
Management
Metadata Management
Process Management
Serving
Infrastructure
Monitoring
Data Collection
Data Verification
Feature Engineering
Testing and
Debugging
Resource
Management
Metadata Management
Process Management
Serving
Infrastructure
Monitoring
Building Operational ML applications is very complex.
Data is at the core of that complexity.
5
Configuration
Automation
Data-Related
ML
Code
Model Analysis
Sculley, David, et al. "Machine learning: The high interest credit card of technical debt." (2014).
Features are the signals we extract from data and
are a critical part of any ML application.
6
“Applied machine learning is basically feature engineering.”
– Andrew Ng
user_id user_clicks_last7d user_in_target_cou... ...
1001 2 1 ...
1002 13 1 ...
1003 0 0 ...
... ... ... ...
Tooling for managing features is almost non-existent
7
Models
MLOps PipelineModel Training
Model Serving
DevelopmentApps Production
DevOps Pipeline
RunBuild
Development
Features /
Data
Feature 1Feature 1
...
? ...
22
Feature Engineering Feature Serving
Tecton is a data platform for ML applications
Faster development cycles
Lower time-to-production
Lower operational cost
Easier adoption of ML across teams
8
Critical problems solved by data platforms for ML:
9
Managing sprawling
feature transform logic
Building high-quality
training sets from messy data
Deploying to production
& moving beyond batch to
real-time
10
Managing sprawling and disconnected feature transform logic
Challenge 1
Features ModelRaw Data
Features are some of the most highly curated and refined data in a business,
yet they are also some of the most poorly managed assets
11
Managing sprawling and disconnected feature transform logic
Challenge 1
Tecton centrally manages features as software assets
Solution
12
Feature Store
Raw Data Model
Solution
Easily contribute new features
to the feature store
13
1. Define a feature 2. Save it to the feature store
feature_repo/ads_behavior/user_clicks.py $ tecton apply
Tecton CLI
14
Common challenges assembling training data
Stitching multiple data pipelines / data sources
together
Feature backfills
Time travel + point-in-time correctness
Data leakage
Delivering training data to training jobs
15
Building
high-quality
training sets from
messy data
Challenge 2
2. Get training data for the events of interest
1. Configure what features you want in a training dataset
16
Solution
Configuration-based
training data set
generation through
simple APIs
user_id ad_id timestamp
111 444 2020-02-01...
222 555 2020-02-02...
333 666 2020-02-01...
... ... ...
distinct_users user_clicks_last7d user_in_target_cou... ...
24 2 1 ...
21 13 1 ...
20 0 0 ...
... ... ... ...
Solution
Built-in row-level time travel for accurate training data
17
1. Modeler provides event time
and entity IDs
2. Tecton returns point-in-time
correct feature values
Common challenges when deploying to
production
Throwing it over the wall for reimplementation in
production environment
Infrastructure provisioning
Freshness vs. cost
Train-serve skew
Drift & data quality monitoring
18
Deploying to
production & moving
from batch to real-time
Challenge 3
Training
Features
19
Today, moving to production requires reimplementation
Challenge 3
TrainingTraining
ImplementationRaw Data
Training
Features
Serving
Features
20
Today, moving to production requires reimplementation
Challenge 3
Serving
Implementation
Training
Prediction?
Training
Implementation
Production
Data
Raw Data
Serving
FeaturesProduction
Data
Raw Data
21
Duplicate transform implementations are error-prone
Challenge 3
Training
Features Training
Prediction
Training
Implementation
Serving
Implementation
Training
Features Training
Prediction
Training
Implementation
Serving
Features
22
Differences in implementations or data across
training and serving can break models
Challenge 3
Duplicate transform implementations lead to train-serve skew
A
B
A B
Production
Data
Raw Data
Serving
Implementation
Unified train/predict pipelines ensure online/offline consistency
23
Solution
Training
Features Training
PredictionServing
FeaturesProduction
Data
Raw Data
Training + Serving
Implementation
Tecton delivers those features “online” for real-time predictions
24
Solution
● Single row
● Delivered in milliseconds
● REST API
Training
Features Training
PredictionServing
Features
● Millions of rows
● Dataframe API
Virtual Data
Source
Training + Serving
Implementation
End-to-End Feature Lifecycle Management
25
26
Principal Data Scientist
Geoff Sims
How Atlassian Uses
Tecton to Get ML
Into Production
Example: Automated Content Categorization in Jira
27
Solution
Automated labeling is needed for
every Jira issue
28
29
30
Thank you
31

Weitere ähnliche Inhalte

Was ist angesagt?

H2O World - Building a Smarter Application - Tom Kraljevic
H2O World - Building a Smarter Application - Tom KraljevicH2O World - Building a Smarter Application - Tom Kraljevic
H2O World - Building a Smarter Application - Tom KraljevicSri Ambati
 
Disrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging TechnologiesDisrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging TechnologiesDatabricks
 
H2O World - Self Guiding Applications with Venkatesh Yadav
H2O World - Self Guiding Applications with Venkatesh YadavH2O World - Self Guiding Applications with Venkatesh Yadav
H2O World - Self Guiding Applications with Venkatesh YadavSri Ambati
 
Feature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scaleFeature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scaleNoriaki Tatsumi
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makerszekeLabs Technologies
 
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Provectus
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Productioniguazio
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Itai Yaffe
 
FrugalML: Using ML APIs More Accurately and Cheaply
FrugalML: Using ML APIs More Accurately and CheaplyFrugalML: Using ML APIs More Accurately and Cheaply
FrugalML: Using ML APIs More Accurately and CheaplyDatabricks
 
Getting Started with Azure AutoML
Getting Started with Azure AutoMLGetting Started with Azure AutoML
Getting Started with Azure AutoMLVivek Raja P S
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksDatabricks
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflowDatabricks
 
Model Experiments Tracking and Registration using MLflow on Databricks
Model Experiments Tracking and Registration using MLflow on DatabricksModel Experiments Tracking and Registration using MLflow on Databricks
Model Experiments Tracking and Registration using MLflow on DatabricksDatabricks
 
Mohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowMohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowLviv Startup Club
 
Ml ops past_present_future
Ml ops past_present_futureMl ops past_present_future
Ml ops past_present_futureNisha Talagala
 

Was ist angesagt? (20)

H2O World - Building a Smarter Application - Tom Kraljevic
H2O World - Building a Smarter Application - Tom KraljevicH2O World - Building a Smarter Application - Tom Kraljevic
H2O World - Building a Smarter Application - Tom Kraljevic
 
Disrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging TechnologiesDisrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging Technologies
 
H2O World - Self Guiding Applications with Venkatesh Yadav
H2O World - Self Guiding Applications with Venkatesh YadavH2O World - Self Guiding Applications with Venkatesh Yadav
H2O World - Self Guiding Applications with Venkatesh Yadav
 
Feature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scaleFeature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scale
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
 
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
 
Global ai conf_final
Global ai conf_finalGlobal ai conf_final
Global ai conf_final
 
FrugalML: Using ML APIs More Accurately and Cheaply
FrugalML: Using ML APIs More Accurately and CheaplyFrugalML: Using ML APIs More Accurately and Cheaply
FrugalML: Using ML APIs More Accurately and Cheaply
 
Getting Started with Azure AutoML
Getting Started with Azure AutoMLGetting Started with Azure AutoML
Getting Started with Azure AutoML
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with Databricks
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
 
Model Experiments Tracking and Registration using MLflow on Databricks
Model Experiments Tracking and Registration using MLflow on DatabricksModel Experiments Tracking and Registration using MLflow on Databricks
Model Experiments Tracking and Registration using MLflow on Databricks
 
Mohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowMohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with Kubeflow
 
Msst 2019 v4
Msst 2019 v4Msst 2019 v4
Msst 2019 v4
 
Ml ops past_present_future
Ml ops past_present_futureMl ops past_present_future
Ml ops past_present_future
 

Ähnlich wie Accelerating the ML Lifecycle with an Enterprise-Grade Feature Store

Why Data Virtualization? An Introduction
Why Data Virtualization? An IntroductionWhy Data Virtualization? An Introduction
Why Data Virtualization? An IntroductionDenodo
 
Accelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature EngineeringAccelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature EngineeringCognizant
 
Informatica_3_yrs_Exp_revised
Informatica_3_yrs_Exp_revisedInformatica_3_yrs_Exp_revised
Informatica_3_yrs_Exp_revisedkumari anuradha
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...James Anderson
 
Jazz for Service Management
Jazz for Service ManagementJazz for Service Management
Jazz for Service ManagementIBM Danmark
 
Simplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseFeatureByte
 
Brighttalk converged infrastructure and it operations management - final
Brighttalk   converged infrastructure and it operations management - finalBrighttalk   converged infrastructure and it operations management - final
Brighttalk converged infrastructure and it operations management - finalAndrew White
 
Emvigo Data Visualization - E Commerce Deck
Emvigo Data Visualization - E Commerce DeckEmvigo Data Visualization - E Commerce Deck
Emvigo Data Visualization - E Commerce DeckEmvigo Technologies
 
L'évolution du métier du DAF induite par la transformation digitale
L'évolution du métier du DAF induite par la transformation digitale L'évolution du métier du DAF induite par la transformation digitale
L'évolution du métier du DAF induite par la transformation digitale Microsoft Ideas
 
Bimodal IT and EDW Modernization
Bimodal IT and EDW ModernizationBimodal IT and EDW Modernization
Bimodal IT and EDW ModernizationRobert Gleave
 
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)Denodo
 
Engineering_Campus_Presentation_2022 (1)-compressed.pptx
Engineering_Campus_Presentation_2022 (1)-compressed.pptxEngineering_Campus_Presentation_2022 (1)-compressed.pptx
Engineering_Campus_Presentation_2022 (1)-compressed.pptxManikaahuja4
 
Production machine learning: Managing models, workflows and risk at scale
Production machine learning: Managing models, workflows and risk at scaleProduction machine learning: Managing models, workflows and risk at scale
Production machine learning: Managing models, workflows and risk at scaleAlex Housley
 
How Large Enterprises Use Platform Governance to Gain Agility
How Large Enterprises Use Platform Governance to Gain AgilityHow Large Enterprises Use Platform Governance to Gain Agility
How Large Enterprises Use Platform Governance to Gain AgilityOdaseva
 
Prateek sharma etl_datastage_exp3.9yrs_resume
Prateek sharma etl_datastage_exp3.9yrs_resumePrateek sharma etl_datastage_exp3.9yrs_resume
Prateek sharma etl_datastage_exp3.9yrs_resumePrateek Sharma
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...InfluxData
 
Rethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldRethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldHao Tran
 

Ähnlich wie Accelerating the ML Lifecycle with an Enterprise-Grade Feature Store (20)

Why Data Virtualization? An Introduction
Why Data Virtualization? An IntroductionWhy Data Virtualization? An Introduction
Why Data Virtualization? An Introduction
 
Accelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature EngineeringAccelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature Engineering
 
Informatica_3_yrs_Exp_revised
Informatica_3_yrs_Exp_revisedInformatica_3_yrs_Exp_revised
Informatica_3_yrs_Exp_revised
 
Hybrid ICC
Hybrid ICCHybrid ICC
Hybrid ICC
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
Jazz for Service Management
Jazz for Service ManagementJazz for Service Management
Jazz for Service Management
 
BAKKIYA_4YR
BAKKIYA_4YRBAKKIYA_4YR
BAKKIYA_4YR
 
Simplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data Warehouse
 
Brighttalk converged infrastructure and it operations management - final
Brighttalk   converged infrastructure and it operations management - finalBrighttalk   converged infrastructure and it operations management - final
Brighttalk converged infrastructure and it operations management - final
 
IBM Think Milano
IBM Think MilanoIBM Think Milano
IBM Think Milano
 
Emvigo Data Visualization - E Commerce Deck
Emvigo Data Visualization - E Commerce DeckEmvigo Data Visualization - E Commerce Deck
Emvigo Data Visualization - E Commerce Deck
 
L'évolution du métier du DAF induite par la transformation digitale
L'évolution du métier du DAF induite par la transformation digitale L'évolution du métier du DAF induite par la transformation digitale
L'évolution du métier du DAF induite par la transformation digitale
 
Bimodal IT and EDW Modernization
Bimodal IT and EDW ModernizationBimodal IT and EDW Modernization
Bimodal IT and EDW Modernization
 
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
 
Engineering_Campus_Presentation_2022 (1)-compressed.pptx
Engineering_Campus_Presentation_2022 (1)-compressed.pptxEngineering_Campus_Presentation_2022 (1)-compressed.pptx
Engineering_Campus_Presentation_2022 (1)-compressed.pptx
 
Production machine learning: Managing models, workflows and risk at scale
Production machine learning: Managing models, workflows and risk at scaleProduction machine learning: Managing models, workflows and risk at scale
Production machine learning: Managing models, workflows and risk at scale
 
How Large Enterprises Use Platform Governance to Gain Agility
How Large Enterprises Use Platform Governance to Gain AgilityHow Large Enterprises Use Platform Governance to Gain Agility
How Large Enterprises Use Platform Governance to Gain Agility
 
Prateek sharma etl_datastage_exp3.9yrs_resume
Prateek sharma etl_datastage_exp3.9yrs_resumePrateek sharma etl_datastage_exp3.9yrs_resume
Prateek sharma etl_datastage_exp3.9yrs_resume
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
 
Rethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldRethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile World
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 

Kürzlich hochgeladen (20)

原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 

Accelerating the ML Lifecycle with an Enterprise-Grade Feature Store

  • 1. Accelerating the ML lifecycle with an enterprise-grade Feature Store Mike Del Balso, Tecton Geoff Sims, Atlassian Spark+AI Summit Wednesday, 6/24 3:40pm PDT 1
  • 2. 2 Product Manager / Ads ML Product Manager / Created Michelangelo Platform Co-founder and CEO / Data Platform for ML 2 About Us Mike Del Balso Principal Data Scientist Geoff Sims
  • 3. Machine learning today often falls short of its potential Limited predictive data Long development cycles Painful path to Production 3
  • 4. Fraud detection CTR Prediction Pricing Customer support Recommendation Search Segmentation Chat bots Common Use CasesOperational ML powers automated business decisions and customer experiences at scale Characteristics 4 Customer-facing impact Time-sensitive Production SLAs Subject to Regulation Cross-functional stakeholders
  • 5. Data Collection Data Verification Feature Engineering Testing and Debugging Resource Management Metadata Management Process Management Serving Infrastructure Monitoring Data Collection Data Verification Feature Engineering Testing and Debugging Resource Management Metadata Management Process Management Serving Infrastructure Monitoring Building Operational ML applications is very complex. Data is at the core of that complexity. 5 Configuration Automation Data-Related ML Code Model Analysis Sculley, David, et al. "Machine learning: The high interest credit card of technical debt." (2014).
  • 6. Features are the signals we extract from data and are a critical part of any ML application. 6 “Applied machine learning is basically feature engineering.” – Andrew Ng user_id user_clicks_last7d user_in_target_cou... ... 1001 2 1 ... 1002 13 1 ... 1003 0 0 ... ... ... ... ...
  • 7. Tooling for managing features is almost non-existent 7 Models MLOps PipelineModel Training Model Serving DevelopmentApps Production DevOps Pipeline RunBuild Development Features / Data Feature 1Feature 1 ... ? ... 22 Feature Engineering Feature Serving
  • 8. Tecton is a data platform for ML applications Faster development cycles Lower time-to-production Lower operational cost Easier adoption of ML across teams 8
  • 9. Critical problems solved by data platforms for ML: 9 Managing sprawling feature transform logic Building high-quality training sets from messy data Deploying to production & moving beyond batch to real-time
  • 10. 10 Managing sprawling and disconnected feature transform logic Challenge 1 Features ModelRaw Data Features are some of the most highly curated and refined data in a business, yet they are also some of the most poorly managed assets
  • 11. 11 Managing sprawling and disconnected feature transform logic Challenge 1
  • 12. Tecton centrally manages features as software assets Solution 12 Feature Store Raw Data Model
  • 13. Solution Easily contribute new features to the feature store 13 1. Define a feature 2. Save it to the feature store feature_repo/ads_behavior/user_clicks.py $ tecton apply Tecton CLI
  • 14. 14
  • 15. Common challenges assembling training data Stitching multiple data pipelines / data sources together Feature backfills Time travel + point-in-time correctness Data leakage Delivering training data to training jobs 15 Building high-quality training sets from messy data Challenge 2
  • 16. 2. Get training data for the events of interest 1. Configure what features you want in a training dataset 16 Solution Configuration-based training data set generation through simple APIs
  • 17. user_id ad_id timestamp 111 444 2020-02-01... 222 555 2020-02-02... 333 666 2020-02-01... ... ... ... distinct_users user_clicks_last7d user_in_target_cou... ... 24 2 1 ... 21 13 1 ... 20 0 0 ... ... ... ... ... Solution Built-in row-level time travel for accurate training data 17 1. Modeler provides event time and entity IDs 2. Tecton returns point-in-time correct feature values
  • 18. Common challenges when deploying to production Throwing it over the wall for reimplementation in production environment Infrastructure provisioning Freshness vs. cost Train-serve skew Drift & data quality monitoring 18 Deploying to production & moving from batch to real-time Challenge 3
  • 19. Training Features 19 Today, moving to production requires reimplementation Challenge 3 TrainingTraining ImplementationRaw Data
  • 20. Training Features Serving Features 20 Today, moving to production requires reimplementation Challenge 3 Serving Implementation Training Prediction? Training Implementation Production Data Raw Data
  • 21. Serving FeaturesProduction Data Raw Data 21 Duplicate transform implementations are error-prone Challenge 3 Training Features Training Prediction Training Implementation Serving Implementation
  • 22. Training Features Training Prediction Training Implementation Serving Features 22 Differences in implementations or data across training and serving can break models Challenge 3 Duplicate transform implementations lead to train-serve skew A B A B Production Data Raw Data Serving Implementation
  • 23. Unified train/predict pipelines ensure online/offline consistency 23 Solution Training Features Training PredictionServing FeaturesProduction Data Raw Data Training + Serving Implementation
  • 24. Tecton delivers those features “online” for real-time predictions 24 Solution ● Single row ● Delivered in milliseconds ● REST API Training Features Training PredictionServing Features ● Millions of rows ● Dataframe API Virtual Data Source Training + Serving Implementation
  • 26. 26 Principal Data Scientist Geoff Sims How Atlassian Uses Tecton to Get ML Into Production
  • 27. Example: Automated Content Categorization in Jira 27 Solution Automated labeling is needed for every Jira issue
  • 28. 28
  • 29. 29
  • 30. 30