SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
A Practical Feature
Store on Delta Lake
Nathan Buesgens
ML Operations
Bryan Christian
Data Science
Agenda
§ What is a Feature Store?
▪ MLOps for Acceleration and
Governance in the Enterprise
▪ Feature Store: Use Cases
▪ Edge Cases: 80/20
▪ Relation to the Data Warehouse
§ Design Reference
▪ Logical Data Model & Access
Patterns
▪ Physical Representation in the Delta
Lake
What is a Feature Store?
75%
Reduction in Feature Engineering
“Data Wrangling” Time
15X
Accelerated Model Delivery
with MLOps Automation and
Governance
END-TO-END VALUE DELIVERY
TIME TO VALUE & CONCURRENCY
SCALABLE INFRASTRUCTURE
I.E. AVOID:
“PROOF OF CONCEPT FACTORY”
MLOps: Data Science at Scale
BOTTLENECK
Feature
Engineering
Modelling
The feature store serves as the
consumption layer for ML
applications. It provides:
• Acceleration: pre-”hardened”
features reduces data wrangling
time for the Data Scientist.
• Governance: a common
consumptions pattern ensures
nothing is lost in the translation
to production.
Predictions
Curated
Data
Feature
Engineering
Modelling
Feature
Engineering
Modelling
Modelling
Modelling
Modelling
Feature
Store
Example: Feature Store
Infrastructure to support DS + MLE
The Feature Store is built on the following data science requirements that are relevant to predictive
analytics in Financial Services use cases.
Correct and consistently applied
joins across of multiple Delta
files without loss of processing
speed
Aggregations, window functions,
and transformations of data
Granularity of point in time and
level of the prediction (e.g.
individual, account, etc.)
customer_id as_of feature_name_last_0-30_days_prior feature_name_last_31-60_days_prior feature_name_next_1-30_days
12345 2021-05-01 0.43 0.32 0.21
23456 2021-05-01 0.99 0.94 0.98
34567 2021-05-01 0.03 0.92 0.13
45678 2021-05-01 0.42 0.59 0.50
The Feature Store uses “as_of” date for the point in time granularity for both backwards- and forward-
facing windows. Code-embedded metadata allows easy removal of future facing windows as
“independent” variables to prevent feature leakage.
Data Science Use Cases
§ Many ML use cases that don’t have an
online requirement: Esp.
“Human + AI”
§ Extending the MVP:
▪ Some online use cases can be
reframed as streaming use cases.
▪ Online use cases can be met with
extension to the Delta Lake design.
▪ See: feast.dev
§ Low-code & ciGzen science expands
user base, doesn’t necessarily
accelerate exisGng users.
§ 80/20 value from:
Op#mizing Access vs. Op#mizing
ETL Development
“Online” Features
Ultra-Low-Latency, Ultra-Timely Point Reads
Low-Code ETL
Configuration Based, AutoML, FeatureFlow, etc.
Edge Cases
Opportunities to Simplify for an 80/2- Feature Store MVP
▪ “Golden” aggregates of curated data.
▪ Highly structured, well-defined
granularities (esp. as 80/20 solution).
▪ Similar non-functional requirements for
strong governance standards, metadata
management, discovery, etc.
▪ Different Use Case: BI vs. Modelling
▪ Different Access Patterns, therefore:
▪ Different Data Model
▪ Different Technology Stack
▪ Supervised learning creates complex
requirements for:
“point in time accurate data”
• Differences
• Similarities
Comparison with Data Warehouse
i.e. Dimensional Model
Design
WINDOW FUNCTIONS
WATERMARK
1
2
3
FEATURE LEAKAGE
Point in Time Accurate Data
Three Ways Inconsistency Sneaks In
Structured Streaming Programming Guide
WINDOW FUNCTIONS
WATERMARK
1
2
3
FEATURE LEAKAGE
Point in Time Accurate Data
Three Ways Inconsistency Sneaks In
§ The thing being modelled.
The “Entity”
Term barrowed from Feast
Granularity
“As of”
Every feature for an entity “as of” a date.
Columns
§ Discrete granularity (daily, hourly, etc.), not an
“event time”.
§ 80/20 solution.
§ For “continuous” granularity see: Feast.
Features
Un-vectorized (80/20)
Targets
Necessarily at same granularity as features.
Predictions
One model’s prediction is often another’s feature.
Feature Store Logical Model
Data Model for Feature Store Access
No need to rebuild the whole
feature store when new features
are added.
(Certain sets of features might be rebuilt
at times, though they will have severely
shorter downtime.)
The SDK indexes the available features and upon request builds the joins to combine all desired features
into one cohesive data frame to provide a production grade feature selection tool.
Keyword searching enabled for
features so you can find any
feature you're looking for using
"human" logic
Tuning can be specific to each set
of features allowing more optimal
feature creation.
find()
select()
select_by()
To search through all columns and metadata for the features you want to use by giving keys, keywords or regex.
When you know exactly the features you want
Selecting columns and returning a dataframe you want to use by giving a date, keys, keywords or regex
Core Functionality
SDK for Feature Store
find() To search through all columns and metadata for the features you want to use by giving keys, keywords or regex.
regexp
kwrds
keys
kwrds_exclude
partial
partial_exclude
verbose
case_sensitive
A regular expression
A list of key words to look for
A dictionary of str, any pointing to tags in the metadata of features, ie {"model_output": True,}
A list of words to exclude from search
If kwrds is used, this decides if it should find all or any of them when searching.
If kwrds_exclude is used, this decides if it will exclude all or any of them when searching
If True, prints out results otherwise just returns them.
If True, an exact match is required to return results.
Arguments
fs.find(regexp="^(?=.*asdf)(?=.*qw
erty).+")
Your search returned 20 results…
feature_name_1: {'comment': 'Flag if asdf > 0.3 at any point within the last 3 months.'}
feature_name_qwerty_1: {'comment': 'Average number of widgets customer purchased in the last 0-1 months.'}
...
Example
Calling the feature store with “fs”, a command could be:
With a returned result of…
The find method searches through all features given a set of criteria and returns any matches within the name or metadata
of columns. It is a great tool to explore the data without pulling in massive datasets
Value to Data Scientist
Explore what features are in
the feature store via metadata
and leverage metadata to
enforce governance (e.g., no
PI, 3rd party data, etc. as
needed)
SDK for Feature Store
date
*features
Return features given a specific date or use "latest" to return the last
updated feature date. For specific dates, please include a dictionary
with an operator and a date i.e. {">": "2021-05-01"}
Feature names as strings
Arguments
dataframe_name = fs.select( "latest", # Give a date {"=": "2021-05-01"} or "latest" for the newest available features
“feature_name_last_0-30_days_prior”, “feature_name_last_31-60_days_prior”, “feature_name_next_1-30_days” # List the features you want )
display(dataframe_name)
Example
Calling the feature store with “fs”, a command could be:
With a returned result of…
The select method will return a dataframe of all selected features with the given date.
select() When you know exactly the features you want
customer_id as_of feature_name_last_0-30_days_prior feature_name_last_31-60_days_prior feature_name_next_1-30_days
12345 2021-05-01 0.43 0.32 0.21
23456 2021-05-01 0.99 0.94 0.98
Consistent way of selecting the
same feature set from the feature
store – consistent in dev and when
deployed in production
Value to Data Scientist
Consistent way of selecting
(in dev and prod) the same
feature set from the feature
store when creating a
dataframe
SDK for Feature Store
customer_id as_of feature_name_1 feature_name_qwerty_1 …
12345 2021-05-01 0.43 0.32 …
23456 2021-05-01 0.99 0.94 …
select_by() Selecting columns and returning a dataframe you want to use by giving a date, keys, keywords or regex
date
regexp
kwrds
keys
kwrds_exclude
partial
partial_exclude
case_sensitive
Return features given a specific date or use "latest" to return the last updated feature date.
For specific dates, please include a dictionary with an operator and a date i.e. {">": "2021-05-01"}
A regular expression
A list of key words to look for
A dictionary of str, any pointing to tags in the metadata of features, ie {"model_output": True,}
A list of words to exclude from search
If kwrds is used, this decides if it should find all or any of them when searching.
If kwrds_exclude is used, this decides if it will exclude all or any of them when searching
If True, an exact match is required to return results.
Arguments
dataframe_name = fs.select_by("=": "2021-05-01“,
regexp="^(?=.*asdf)(?=.*qwerty).+")
display(dataframe_name)
Example
Calling the feature store with “fs”, a command could be:
With a returned result of…
The select_by method searches through all features given a set of criteria and returns a dataframe including all the
features that match the criteria within the name or metadata.
Value to Data Scientist
Consistent way of exploring
the feature store and
leveraging metadata for
selection while simultaneity
creating a dataframe with the
selected features
SDK for Feature Store
Gold
BI Consumption:
Dimensional
Model
Bronze Silver
ML Consumption:
Feature Store
The Delta Lake
Optional:
Consumption
Optimized Databases
ETL ETL
Low Latency
Memory Cache
High Concurrency
Data Warehouse
Mirror
Mirror
Implementation on the Data Lake
Bronze Silver
ML Consumption:
Feature Store
The Delta Lake
Optional:
Consumption
Optimized Databases
ETL ETL
Low Latency
Memory Cache
Mirror
SDK (Data Access Layer)
• Consistent view of “online” and “historic” features.
• Separation of logical and physical models.
• Metadata focused query interface for data science
exploration.
Historic Feature
Queries
Online Point
Reads
Implementation on the Data Lake
§ Simplifies “point in .me joins”.
§ Not as flexible or .mely.
Pre-defined time aggregations
“As Of” Granularity
“Dynamic Point in Time Joins”
Demonstrated by Feast
More flexible, improved timeliness.
Multiple feature tables
Technically possible to use a single wide table.
§ Simplifies:
▪ Schema Migration
▪ Query Planning & Optimization
▪ Scheduling
Physical Feature Tables
Two Choices
Summary
1
Feature stores accelerate data science & enable
better governance.
2
Most design complexity stems from machine
learning requirements for point in time accurate data.
3
80/20 solutions possible by carefully considering
“online” requirements.
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Weitere ähnliche Inhalte

Was ist angesagt?

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
Databricks
 

Was ist angesagt? (20)

Building Data Lakes with Apache Airflow
Building Data Lakes with Apache AirflowBuilding Data Lakes with Apache Airflow
Building Data Lakes with Apache Airflow
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Managed Feature Store for Machine Learning
Managed Feature Store for Machine LearningManaged Feature Store for Machine Learning
Managed Feature Store for Machine Learning
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
MLOps with Azure DevOps
MLOps with Azure DevOpsMLOps with Azure DevOps
MLOps with Azure DevOps
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data Pipelines
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
MLOps by Sasha Rosenbaum
MLOps by Sasha RosenbaumMLOps by Sasha Rosenbaum
MLOps by Sasha Rosenbaum
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
 
ETL tool evaluation criteria
ETL tool evaluation criteriaETL tool evaluation criteria
ETL tool evaluation criteria
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 

Ähnlich wie A Practical Enterprise Feature Store on Delta Lake

Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluation
avniS
 
DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)
Data Finder
 
Odtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youOdtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for you
Luc Bors
 

Ähnlich wie A Practical Enterprise Feature Store on Delta Lake (20)

NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluation
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
 
Analysis Services en SQL Server 2008
Analysis Services en SQL Server 2008Analysis Services en SQL Server 2008
Analysis Services en SQL Server 2008
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
Machine Learning on the Microsoft Stack
Machine Learning on the Microsoft StackMachine Learning on the Microsoft Stack
Machine Learning on the Microsoft Stack
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jug
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
 
DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
 
Odtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youOdtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for you
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
At the core you will have KUSTO
At the core you will have KUSTOAt the core you will have KUSTO
At the core you will have KUSTO
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
 
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
 

Mehr von Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Kürzlich hochgeladen

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
shivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
JohnnyPlasten
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
shambhavirathore45
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 

Kürzlich hochgeladen (20)

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

A Practical Enterprise Feature Store on Delta Lake

  • 1. A Practical Feature Store on Delta Lake Nathan Buesgens ML Operations Bryan Christian Data Science
  • 2. Agenda § What is a Feature Store? ▪ MLOps for Acceleration and Governance in the Enterprise ▪ Feature Store: Use Cases ▪ Edge Cases: 80/20 ▪ Relation to the Data Warehouse § Design Reference ▪ Logical Data Model & Access Patterns ▪ Physical Representation in the Delta Lake
  • 3. What is a Feature Store?
  • 4. 75% Reduction in Feature Engineering “Data Wrangling” Time 15X Accelerated Model Delivery with MLOps Automation and Governance END-TO-END VALUE DELIVERY TIME TO VALUE & CONCURRENCY SCALABLE INFRASTRUCTURE I.E. AVOID: “PROOF OF CONCEPT FACTORY” MLOps: Data Science at Scale
  • 5. BOTTLENECK Feature Engineering Modelling The feature store serves as the consumption layer for ML applications. It provides: • Acceleration: pre-”hardened” features reduces data wrangling time for the Data Scientist. • Governance: a common consumptions pattern ensures nothing is lost in the translation to production. Predictions Curated Data Feature Engineering Modelling Feature Engineering Modelling Modelling Modelling Modelling Feature Store Example: Feature Store Infrastructure to support DS + MLE
  • 6. The Feature Store is built on the following data science requirements that are relevant to predictive analytics in Financial Services use cases. Correct and consistently applied joins across of multiple Delta files without loss of processing speed Aggregations, window functions, and transformations of data Granularity of point in time and level of the prediction (e.g. individual, account, etc.) customer_id as_of feature_name_last_0-30_days_prior feature_name_last_31-60_days_prior feature_name_next_1-30_days 12345 2021-05-01 0.43 0.32 0.21 23456 2021-05-01 0.99 0.94 0.98 34567 2021-05-01 0.03 0.92 0.13 45678 2021-05-01 0.42 0.59 0.50 The Feature Store uses “as_of” date for the point in time granularity for both backwards- and forward- facing windows. Code-embedded metadata allows easy removal of future facing windows as “independent” variables to prevent feature leakage. Data Science Use Cases
  • 7. § Many ML use cases that don’t have an online requirement: Esp. “Human + AI” § Extending the MVP: ▪ Some online use cases can be reframed as streaming use cases. ▪ Online use cases can be met with extension to the Delta Lake design. ▪ See: feast.dev § Low-code & ciGzen science expands user base, doesn’t necessarily accelerate exisGng users. § 80/20 value from: Op#mizing Access vs. Op#mizing ETL Development “Online” Features Ultra-Low-Latency, Ultra-Timely Point Reads Low-Code ETL Configuration Based, AutoML, FeatureFlow, etc. Edge Cases Opportunities to Simplify for an 80/2- Feature Store MVP
  • 8. ▪ “Golden” aggregates of curated data. ▪ Highly structured, well-defined granularities (esp. as 80/20 solution). ▪ Similar non-functional requirements for strong governance standards, metadata management, discovery, etc. ▪ Different Use Case: BI vs. Modelling ▪ Different Access Patterns, therefore: ▪ Different Data Model ▪ Different Technology Stack ▪ Supervised learning creates complex requirements for: “point in time accurate data” • Differences • Similarities Comparison with Data Warehouse i.e. Dimensional Model
  • 10. WINDOW FUNCTIONS WATERMARK 1 2 3 FEATURE LEAKAGE Point in Time Accurate Data Three Ways Inconsistency Sneaks In
  • 11. Structured Streaming Programming Guide WINDOW FUNCTIONS WATERMARK 1 2 3 FEATURE LEAKAGE Point in Time Accurate Data Three Ways Inconsistency Sneaks In
  • 12. § The thing being modelled. The “Entity” Term barrowed from Feast Granularity “As of” Every feature for an entity “as of” a date. Columns § Discrete granularity (daily, hourly, etc.), not an “event time”. § 80/20 solution. § For “continuous” granularity see: Feast. Features Un-vectorized (80/20) Targets Necessarily at same granularity as features. Predictions One model’s prediction is often another’s feature. Feature Store Logical Model Data Model for Feature Store Access
  • 13. No need to rebuild the whole feature store when new features are added. (Certain sets of features might be rebuilt at times, though they will have severely shorter downtime.) The SDK indexes the available features and upon request builds the joins to combine all desired features into one cohesive data frame to provide a production grade feature selection tool. Keyword searching enabled for features so you can find any feature you're looking for using "human" logic Tuning can be specific to each set of features allowing more optimal feature creation. find() select() select_by() To search through all columns and metadata for the features you want to use by giving keys, keywords or regex. When you know exactly the features you want Selecting columns and returning a dataframe you want to use by giving a date, keys, keywords or regex Core Functionality SDK for Feature Store
  • 14. find() To search through all columns and metadata for the features you want to use by giving keys, keywords or regex. regexp kwrds keys kwrds_exclude partial partial_exclude verbose case_sensitive A regular expression A list of key words to look for A dictionary of str, any pointing to tags in the metadata of features, ie {"model_output": True,} A list of words to exclude from search If kwrds is used, this decides if it should find all or any of them when searching. If kwrds_exclude is used, this decides if it will exclude all or any of them when searching If True, prints out results otherwise just returns them. If True, an exact match is required to return results. Arguments fs.find(regexp="^(?=.*asdf)(?=.*qw erty).+") Your search returned 20 results… feature_name_1: {'comment': 'Flag if asdf > 0.3 at any point within the last 3 months.'} feature_name_qwerty_1: {'comment': 'Average number of widgets customer purchased in the last 0-1 months.'} ... Example Calling the feature store with “fs”, a command could be: With a returned result of… The find method searches through all features given a set of criteria and returns any matches within the name or metadata of columns. It is a great tool to explore the data without pulling in massive datasets Value to Data Scientist Explore what features are in the feature store via metadata and leverage metadata to enforce governance (e.g., no PI, 3rd party data, etc. as needed) SDK for Feature Store
  • 15. date *features Return features given a specific date or use "latest" to return the last updated feature date. For specific dates, please include a dictionary with an operator and a date i.e. {">": "2021-05-01"} Feature names as strings Arguments dataframe_name = fs.select( "latest", # Give a date {"=": "2021-05-01"} or "latest" for the newest available features “feature_name_last_0-30_days_prior”, “feature_name_last_31-60_days_prior”, “feature_name_next_1-30_days” # List the features you want ) display(dataframe_name) Example Calling the feature store with “fs”, a command could be: With a returned result of… The select method will return a dataframe of all selected features with the given date. select() When you know exactly the features you want customer_id as_of feature_name_last_0-30_days_prior feature_name_last_31-60_days_prior feature_name_next_1-30_days 12345 2021-05-01 0.43 0.32 0.21 23456 2021-05-01 0.99 0.94 0.98 Consistent way of selecting the same feature set from the feature store – consistent in dev and when deployed in production Value to Data Scientist Consistent way of selecting (in dev and prod) the same feature set from the feature store when creating a dataframe SDK for Feature Store
  • 16. customer_id as_of feature_name_1 feature_name_qwerty_1 … 12345 2021-05-01 0.43 0.32 … 23456 2021-05-01 0.99 0.94 … select_by() Selecting columns and returning a dataframe you want to use by giving a date, keys, keywords or regex date regexp kwrds keys kwrds_exclude partial partial_exclude case_sensitive Return features given a specific date or use "latest" to return the last updated feature date. For specific dates, please include a dictionary with an operator and a date i.e. {">": "2021-05-01"} A regular expression A list of key words to look for A dictionary of str, any pointing to tags in the metadata of features, ie {"model_output": True,} A list of words to exclude from search If kwrds is used, this decides if it should find all or any of them when searching. If kwrds_exclude is used, this decides if it will exclude all or any of them when searching If True, an exact match is required to return results. Arguments dataframe_name = fs.select_by("=": "2021-05-01“, regexp="^(?=.*asdf)(?=.*qwerty).+") display(dataframe_name) Example Calling the feature store with “fs”, a command could be: With a returned result of… The select_by method searches through all features given a set of criteria and returns a dataframe including all the features that match the criteria within the name or metadata. Value to Data Scientist Consistent way of exploring the feature store and leveraging metadata for selection while simultaneity creating a dataframe with the selected features SDK for Feature Store
  • 17. Gold BI Consumption: Dimensional Model Bronze Silver ML Consumption: Feature Store The Delta Lake Optional: Consumption Optimized Databases ETL ETL Low Latency Memory Cache High Concurrency Data Warehouse Mirror Mirror Implementation on the Data Lake
  • 18. Bronze Silver ML Consumption: Feature Store The Delta Lake Optional: Consumption Optimized Databases ETL ETL Low Latency Memory Cache Mirror SDK (Data Access Layer) • Consistent view of “online” and “historic” features. • Separation of logical and physical models. • Metadata focused query interface for data science exploration. Historic Feature Queries Online Point Reads Implementation on the Data Lake
  • 19. § Simplifies “point in .me joins”. § Not as flexible or .mely. Pre-defined time aggregations “As Of” Granularity “Dynamic Point in Time Joins” Demonstrated by Feast More flexible, improved timeliness. Multiple feature tables Technically possible to use a single wide table. § Simplifies: ▪ Schema Migration ▪ Query Planning & Optimization ▪ Scheduling Physical Feature Tables Two Choices
  • 20. Summary 1 Feature stores accelerate data science & enable better governance. 2 Most design complexity stems from machine learning requirements for point in time accurate data. 3 80/20 solutions possible by carefully considering “online” requirements.
  • 21. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.