Building a Real-Time Feature Store at iFood

Building a Realtime Feature Store
at iFood
Daniel Galinkin
ML Platform Tech Lead

Agenda
iFood and AI
What is the iFood mission, and how we use AI
What is a Feature Store
What is a Feature Store, and why it is important to
solve AI problems
How iFood built its Feature Store
How iFood built its Feature Store by
leveraging Spark, Databricks and Delta Tables

BIGGEST
FOODTECH IN
LATIN AMERICA
(we’re in Brazil, Mexico and Colombia)
~30 million
orders per month
+800 cities
in all Brazilian states
+100 thousand
restaurants

AI Everywhere
▪ Restaurants
recommendations
▪ Dishes recommendations
▪ Optimize the drivers
allocation
▪ Estimate the delivery time
▪ Find the most efficient route
LogisticsDiscovery
▪ Optimize the use of
marketing ads
▪ Optimize the use of
coupons
Marketing

ML Platform
Configuration
Data
Collection
Feature Extraction
ML
Data
Verification
Analysis Tools
Machine
Resource
Management
Process
Management
Tools
Serving
Infrastructure
Monitoring

What are features?
▪ Any kind of data used to train a ML
model
▪ Feature types:
▪ State features
▪ Did the user have a coupon at the time?
▪ Aggregate features
▪ Average ticket price in the last 30 days for the user
▪ External features
▪ Was it raining at the time?

What is a feature store?
▪ The feature store is the central place in an
organization to query for features
▪ Features are mostly used by machine learning
algorithms
▪ They can also be useful for other applications
▪ For example, you could use the average ticket price for a user to show a high end or low end
list of restaurants

Feature store requirements
▪ General:
▪ Low latency
▪ Access & Calculation
▪ Access control
▪ Versioning
▪ Scalability
▪ Easy API for data access
▪ Machine Learning:
▪ Backfilling
▪ “Time-travel” - snapshot for historical feature values

How iFood built its Feature Store

Feature Store
Aggregation Service
iFood Software Architecture
Streaming as a ﬁrst-class citizen
Orders
Microservice
Payments
Microservice
Fleet location
Microservice
Sessions
Microservice
Coupons
Microservice
Notifications
Microservice
Real-time
Data Lake
Feature
Store
Realtime events
Central Bus

iFood Real-time Data Lake Architecture
▪ Kafka storage is expensive
▪ Retention is limited
▪ Full event history enables
recalculation and backﬁlling for
features
▪ Delta tables provide a cheap
storage option
▪ Delta tables can double as either
batch or streaming sources
Realtime events
Central Bus
Data Lake Streaming Jobs
Data Lake
Streaming
Delta Table

iFood Feature Store Architecture
Kafka Bus
Real-time
Redis
Storage
Data Lake
Streaming
Delta Table
DynamoDB
Metadata
Aggregation Jobs
Aggregation Jobs
Aggregation Jobs
Historic Backfilling Jobs
Historic
Materialization Job
Real-time
Materialization Job
Historic
Delta Table
Storage

Kafka Bus
Data Lake
Streaming
Delta Table
Aggregation Jobs
Aggregation Jobs
Aggregation Jobs
Real-time
Redis
Storage
DynamoDB
Metadata
Historic
Materialization Job
Real-time
Materialization Job
Historic
Delta Table
Storage
The aggregation jobs

▪ Features are usually combinations of:
▪ Source - orders stream
▪ Window range - last 30 days
▪ Grouping key - by each user
▪ Value - ticket price
▪ Filter - during lunch
▪ Aggregation type - average
Kafka Bus
Data Lake
Streaming
Delta Table
Aggregation Jobs
Aggregation Jobs
Aggregation Jobs

▪ With spark streaming, you can
only execute one group by
operation per dataframe/job
▪ Each combination of grouping
key and window range results in a
new dataframe
▪ That means increased costs and
operational complexity
ordersStreamDF
.groupBy(col("user_id"), window(col("order_ts"), "1 day"))
.agg(sum("ticket"))
ordersStreamDF
.groupBy(col("user_id"), window(col("order_ts"), "3 days"))
.agg(sum("ticket"))
ordersStreamDF
.groupBy(col("user_id"), window(col("order_ts"), "7 days"))
.agg(sum("ticket"))

▪ We store the intermediate state for several
aggregation types for a ﬁxed smaller window
▪ We then combine the results to emit the result
for several window sizes at once
▪ This also allows us use the same code and the
same job to calculate historical and real-time
features

The aggregation jobs - Two-step aggregation logic
Orders
Streaming
Source
D-6
1
D-5
2
D-4
3
D-3
0
D-2
1
D-1
1
D-0
2
D-6
1
D-5
2
D-4
3
D-3
0
D-2
1
D-1
1
D-0
2
D-6 to D-4
6
D-5 to D-3
5
D-4 to D-2
4
D-6 to D-0
10
D-3 to D-1
2
D-2 to D-0
4
1 day windows
3 days windows
7 days windows

▪ How to express that?
▪ flatMapGroupsWithState
▪ Flexibility on storing state and
expressing calculation logic
▪ That allows us to combine
dozens of jobs into one
def combineAggregations(
sourceDF: DataFrame,
groupByKeys: Seq[String],
windowStep: Long,
combinationRules: Seq[CombinationRule]): DataFrame = {
putStateAndOutputPlaceholdersToFitCombinedSchema(df)
.groupByKey(row => combineGroupKeys())
.flatMapGroupsWithState((state, miniBatchIterator) => {
miniBatchIterator.foreach(row => {
if (inputWindowEnd() > newestOutputWindowEnd()) {
moveStateRangeForward()
}
if (inputRowIsInStateRange()) {
firstStepUpdateIntermediateValue()
}
})
combinationRules.foreach(combinationRule => {
secondStepCalculateFinalResultBasedOnIntermediateValues()
})
yieldAnOutputRowBasedOnTheResults()
})
}

Order ID
Customer
ID
Date ...
Customer 1 2020-01-01 ...
Entity Entity ID Date Feat. Name Feat. Value
Customer 1 2020-01-01 NOrders1Day 2
Customer 1 2020-01-01 NOrders3Days 6
Customer 1 2020-01-01 NOrders7Days 10

Kafka Bus
Real-time
Redis
Storage
Historic
Materialization Job
Real-time
Materialization Job
Historic
Delta Table
Storage
Data Lake
Streaming
Delta Table
Aggregation Jobs
Aggregation Jobs
Aggregation Jobs
DynamoDB
Metadata
The materialization jobs

▪ Feature update commands are
stored to a kafka topic - think
CDC or log tailing
▪ Update feature F for entity E at row R with value V
▪ Using the Delta Table Storage,
we use MERGE INTO and the
map_concat function to be
ﬂexible
Customer 1 2020-01-01
AvgTicket
Price30Days
25.8
Entity Entity ID Date Features Map
Customer 1 2020-01-01
AvgTicketPrice30
Days -> 25.8
Customer 2 2020-02-01
NOrders30D
ays
17
Customer 1 2020-01-01
AvgTicketPrice30
Days -> 25.8
Customer 2 2020-02-01
NOrders30Days
-> 17
Customer 1 2020-01-01
NOrders30D
ays
3
Customer 1 2020-01-01
AvgTicketPrice30
Days -> 25.8
NOrders30Days
-> 3
Customer 2 2020-02-01
NOrders30Days
-> 17

▪ Consumers are free to materialize
them to their database of choice
▪ For ML, we use:
▪ A delta table for historic feature values
▪ A redis cluster for low latency real-time access
Kafka Bus
Real-time
Redis
Storage
Historic
Materialization Job
Real-time
Materialization Job
Historic
Delta Table
Storage

Real-time
Redis
Storage
Historic
Materialization Job
Real-time
Materialization Job
Historic
Delta Table
Storage
Kafka Bus
Data Lake
Streaming
Delta Table
Aggregation Jobs
Aggregation Jobs
Aggregation Jobs
DynamoDB
Metadata
The backﬁlling jobs

The backﬁlling jobs
▪ How to calculate features for streaming
data registered before the creation of a
feature?
▪ Use a metadata database to store the
creation time of each feature
▪ Run a backﬁlling job to create feature
values up to the feature creation
▪ Start the streaming job to emit results
using values that arrive after the
creation date
Kafka Bus
Data Lake
Streaming
Delta Table
Aggregation Jobs
Aggregation Jobs
Aggregation Jobs
DynamoDB
Metadata

Lessons learned & Best practices
▪ Delta Tables double as streaming or batch sources
▪ OPTIMIZE is a must for streaming jobs saving to a
Delta Table
▪ Either on auto mode, or as a separate process
▪ When starting a brand new job from a streaming
delta table source, the ﬁles reading order is not
guaranteed
▪ This is even more noticeable after running OPTIMIZE (which you should!)
▪ If the events processing order is important for your job, either use Trigger.Once to process the
first historical batch, or process each partition sequentially in order

Lessons learned & Best practices
▪ ﬂatMapGroupsWithState is really powerful
▪ State management should be handled with care
▪ foreachBatch is really powerful
▪ Please note it can be triggered on an empty Dataframe, though
▪ Be sure to use correct partition pruning
when using the MERGE INTO operation
▪ Be careful with parameter changes
between job restarts
▪ StreamTest really helps with unit tests,
debugging and raising the bar

Positive outcomes
▪ Uniﬁed codebase for historical and real-time
features - 50% less code
▪ Uniﬁed jobs for historical and real-time
features - from dozens of jobs to around 10
▪ Huge batch ETL jobs are substituted by much
smaller streaming clusters
▪ Though they run 24/7
▪ Delta tables allow for isolation between read
and write operations

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Building a Real-Time Feature Store at iFood

Building a Real-Time Feature Store at iFood

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Building a Real-Time Feature Store at iFood

Ähnlich wie Building a Real-Time Feature Store at iFood (20)

Mehr von Databricks

Mehr von Databricks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Building a Real-Time Feature Store at iFood