The feature store is a data architecture concept used to accelerate data science experimentation and harden production ML deployments. Nate Buesgens and Bryan Christian describe a practical approach to building a feature store on Delta Lake at a large financial organization. This implementation has reduced feature engineering “wrangling” time by 75% and has increased the rate of production model delivery by 15x. The approach described focuses on practicality. It is informed by innovative approaches such as Feast, but our primary goal is evolutionary extensions of existing patterns that can be applied to any Delta Lake architecture.
Key Takeaways:
– Understand the key use cases that motivate the feature store from both a data science and engineering perspective.
– Consider edge cases where there may be opportunities for simplification such as “online” predictions.
– Review a typical logical data model for a feature store and how that can be applied to your business domain.
– Consider options for physical storage of the feature store in the Delta Lake.
– Understand common access patterns including metadata-based feature discovery.
2. Agenda
§ What is a Feature Store?
▪ MLOps for Acceleration and
Governance in the Enterprise
▪ Feature Store: Use Cases
▪ Edge Cases: 80/20
▪ Relation to the Data Warehouse
§ Design Reference
▪ Logical Data Model & Access
Patterns
▪ Physical Representation in the Delta
Lake
4. 75%
Reduction in Feature Engineering
“Data Wrangling” Time
15X
Accelerated Model Delivery
with MLOps Automation and
Governance
END-TO-END VALUE DELIVERY
TIME TO VALUE & CONCURRENCY
SCALABLE INFRASTRUCTURE
I.E. AVOID:
“PROOF OF CONCEPT FACTORY”
MLOps: Data Science at Scale
5. BOTTLENECK
Feature
Engineering
Modelling
The feature store serves as the
consumption layer for ML
applications. It provides:
• Acceleration: pre-”hardened”
features reduces data wrangling
time for the Data Scientist.
• Governance: a common
consumptions pattern ensures
nothing is lost in the translation
to production.
Predictions
Curated
Data
Feature
Engineering
Modelling
Feature
Engineering
Modelling
Modelling
Modelling
Modelling
Feature
Store
Example: Feature Store
Infrastructure to support DS + MLE
6. The Feature Store is built on the following data science requirements that are relevant to predictive
analytics in Financial Services use cases.
Correct and consistently applied
joins across of multiple Delta
files without loss of processing
speed
Aggregations, window functions,
and transformations of data
Granularity of point in time and
level of the prediction (e.g.
individual, account, etc.)
customer_id as_of feature_name_last_0-30_days_prior feature_name_last_31-60_days_prior feature_name_next_1-30_days
12345 2021-05-01 0.43 0.32 0.21
23456 2021-05-01 0.99 0.94 0.98
34567 2021-05-01 0.03 0.92 0.13
45678 2021-05-01 0.42 0.59 0.50
The Feature Store uses “as_of” date for the point in time granularity for both backwards- and forward-
facing windows. Code-embedded metadata allows easy removal of future facing windows as
“independent” variables to prevent feature leakage.
Data Science Use Cases
7. § Many ML use cases that don’t have an
online requirement: Esp.
“Human + AI”
§ Extending the MVP:
▪ Some online use cases can be
reframed as streaming use cases.
▪ Online use cases can be met with
extension to the Delta Lake design.
▪ See: feast.dev
§ Low-code & ciGzen science expands
user base, doesn’t necessarily
accelerate exisGng users.
§ 80/20 value from:
Op#mizing Access vs. Op#mizing
ETL Development
“Online” Features
Ultra-Low-Latency, Ultra-Timely Point Reads
Low-Code ETL
Configuration Based, AutoML, FeatureFlow, etc.
Edge Cases
Opportunities to Simplify for an 80/2- Feature Store MVP
8. ▪ “Golden” aggregates of curated data.
▪ Highly structured, well-defined
granularities (esp. as 80/20 solution).
▪ Similar non-functional requirements for
strong governance standards, metadata
management, discovery, etc.
▪ Different Use Case: BI vs. Modelling
▪ Different Access Patterns, therefore:
▪ Different Data Model
▪ Different Technology Stack
▪ Supervised learning creates complex
requirements for:
“point in time accurate data”
• Differences
• Similarities
Comparison with Data Warehouse
i.e. Dimensional Model
11. Structured Streaming Programming Guide
WINDOW FUNCTIONS
WATERMARK
1
2
3
FEATURE LEAKAGE
Point in Time Accurate Data
Three Ways Inconsistency Sneaks In
12. § The thing being modelled.
The “Entity”
Term barrowed from Feast
Granularity
“As of”
Every feature for an entity “as of” a date.
Columns
§ Discrete granularity (daily, hourly, etc.), not an
“event time”.
§ 80/20 solution.
§ For “continuous” granularity see: Feast.
Features
Un-vectorized (80/20)
Targets
Necessarily at same granularity as features.
Predictions
One model’s prediction is often another’s feature.
Feature Store Logical Model
Data Model for Feature Store Access
13. No need to rebuild the whole
feature store when new features
are added.
(Certain sets of features might be rebuilt
at times, though they will have severely
shorter downtime.)
The SDK indexes the available features and upon request builds the joins to combine all desired features
into one cohesive data frame to provide a production grade feature selection tool.
Keyword searching enabled for
features so you can find any
feature you're looking for using
"human" logic
Tuning can be specific to each set
of features allowing more optimal
feature creation.
find()
select()
select_by()
To search through all columns and metadata for the features you want to use by giving keys, keywords or regex.
When you know exactly the features you want
Selecting columns and returning a dataframe you want to use by giving a date, keys, keywords or regex
Core Functionality
SDK for Feature Store
14. find() To search through all columns and metadata for the features you want to use by giving keys, keywords or regex.
regexp
kwrds
keys
kwrds_exclude
partial
partial_exclude
verbose
case_sensitive
A regular expression
A list of key words to look for
A dictionary of str, any pointing to tags in the metadata of features, ie {"model_output": True,}
A list of words to exclude from search
If kwrds is used, this decides if it should find all or any of them when searching.
If kwrds_exclude is used, this decides if it will exclude all or any of them when searching
If True, prints out results otherwise just returns them.
If True, an exact match is required to return results.
Arguments
fs.find(regexp="^(?=.*asdf)(?=.*qw
erty).+")
Your search returned 20 results…
feature_name_1: {'comment': 'Flag if asdf > 0.3 at any point within the last 3 months.'}
feature_name_qwerty_1: {'comment': 'Average number of widgets customer purchased in the last 0-1 months.'}
...
Example
Calling the feature store with “fs”, a command could be:
With a returned result of…
The find method searches through all features given a set of criteria and returns any matches within the name or metadata
of columns. It is a great tool to explore the data without pulling in massive datasets
Value to Data Scientist
Explore what features are in
the feature store via metadata
and leverage metadata to
enforce governance (e.g., no
PI, 3rd party data, etc. as
needed)
SDK for Feature Store
15. date
*features
Return features given a specific date or use "latest" to return the last
updated feature date. For specific dates, please include a dictionary
with an operator and a date i.e. {">": "2021-05-01"}
Feature names as strings
Arguments
dataframe_name = fs.select( "latest", # Give a date {"=": "2021-05-01"} or "latest" for the newest available features
“feature_name_last_0-30_days_prior”, “feature_name_last_31-60_days_prior”, “feature_name_next_1-30_days” # List the features you want )
display(dataframe_name)
Example
Calling the feature store with “fs”, a command could be:
With a returned result of…
The select method will return a dataframe of all selected features with the given date.
select() When you know exactly the features you want
customer_id as_of feature_name_last_0-30_days_prior feature_name_last_31-60_days_prior feature_name_next_1-30_days
12345 2021-05-01 0.43 0.32 0.21
23456 2021-05-01 0.99 0.94 0.98
Consistent way of selecting the
same feature set from the feature
store – consistent in dev and when
deployed in production
Value to Data Scientist
Consistent way of selecting
(in dev and prod) the same
feature set from the feature
store when creating a
dataframe
SDK for Feature Store
16. customer_id as_of feature_name_1 feature_name_qwerty_1 …
12345 2021-05-01 0.43 0.32 …
23456 2021-05-01 0.99 0.94 …
select_by() Selecting columns and returning a dataframe you want to use by giving a date, keys, keywords or regex
date
regexp
kwrds
keys
kwrds_exclude
partial
partial_exclude
case_sensitive
Return features given a specific date or use "latest" to return the last updated feature date.
For specific dates, please include a dictionary with an operator and a date i.e. {">": "2021-05-01"}
A regular expression
A list of key words to look for
A dictionary of str, any pointing to tags in the metadata of features, ie {"model_output": True,}
A list of words to exclude from search
If kwrds is used, this decides if it should find all or any of them when searching.
If kwrds_exclude is used, this decides if it will exclude all or any of them when searching
If True, an exact match is required to return results.
Arguments
dataframe_name = fs.select_by("=": "2021-05-01“,
regexp="^(?=.*asdf)(?=.*qwerty).+")
display(dataframe_name)
Example
Calling the feature store with “fs”, a command could be:
With a returned result of…
The select_by method searches through all features given a set of criteria and returns a dataframe including all the
features that match the criteria within the name or metadata.
Value to Data Scientist
Consistent way of exploring
the feature store and
leveraging metadata for
selection while simultaneity
creating a dataframe with the
selected features
SDK for Feature Store
17. Gold
BI Consumption:
Dimensional
Model
Bronze Silver
ML Consumption:
Feature Store
The Delta Lake
Optional:
Consumption
Optimized Databases
ETL ETL
Low Latency
Memory Cache
High Concurrency
Data Warehouse
Mirror
Mirror
Implementation on the Data Lake
18. Bronze Silver
ML Consumption:
Feature Store
The Delta Lake
Optional:
Consumption
Optimized Databases
ETL ETL
Low Latency
Memory Cache
Mirror
SDK (Data Access Layer)
• Consistent view of “online” and “historic” features.
• Separation of logical and physical models.
• Metadata focused query interface for data science
exploration.
Historic Feature
Queries
Online Point
Reads
Implementation on the Data Lake
19. § Simplifies “point in .me joins”.
§ Not as flexible or .mely.
Pre-defined time aggregations
“As Of” Granularity
“Dynamic Point in Time Joins”
Demonstrated by Feast
More flexible, improved timeliness.
Multiple feature tables
Technically possible to use a single wide table.
§ Simplifies:
▪ Schema Migration
▪ Query Planning & Optimization
▪ Scheduling
Physical Feature Tables
Two Choices
20. Summary
1
Feature stores accelerate data science & enable
better governance.
2
Most design complexity stems from machine
learning requirements for point in time accurate data.
3
80/20 solutions possible by carefully considering
“online” requirements.