SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
Building a Realtime Feature Store
at iFood
Daniel Galinkin
ML Platform Tech Lead
Agenda
iFood and AI
What is the iFood mission, and how we use AI
What is a Feature Store
What is a Feature Store, and why it is important to
solve AI problems
How iFood built its Feature Store
How iFood built its Feature Store by
leveraging Spark, Databricks and Delta Tables
iFood and AI
BIGGEST
FOODTECH IN
LATIN AMERICA
(we’re in Brazil, Mexico and Colombia)
~30 million
orders per month
+800 cities
in all Brazilian states
+100 thousand
restaurants
AI Everywhere
▪ Restaurants
recommendations
▪ Dishes recommendations
▪ Optimize the drivers
allocation
▪ Estimate the delivery time
▪ Find the most efficient route
LogisticsDiscovery
▪ Optimize the use of
marketing ads
▪ Optimize the use of
coupons
Marketing
ML Platform
Configuration
Data
Collection
Feature Extraction
ML
Data
Verification
Analysis Tools
Machine
Resource
Management
Process
Management
Tools
Serving
Infrastructure
Monitoring
What is a Feature Store
What are features?
▪ Any kind of data used to train a ML
model
▪ Feature types:
▪ State features
▪ Did the user have a coupon at the time?
▪ Aggregate features
▪ Average ticket price in the last 30 days for the user
▪ External features
▪ Was it raining at the time?
What is a feature store?
▪ The feature store is the central place in an
organization to query for features
▪ Features are mostly used by machine learning
algorithms
▪ They can also be useful for other applications
▪ For example, you could use the average ticket price for a user to show a high end or low end
list of restaurants
Feature store requirements
▪ General:
▪ Low latency
▪ Access & Calculation
▪ Access control
▪ Versioning
▪ Scalability
▪ Easy API for data access
▪ Machine Learning:
▪ Backfilling
▪ “Time-travel” - snapshot for historical feature values
How iFood built its Feature Store
Feature Store
Aggregation Service
iFood Software Architecture
Streaming as a first-class citizen
Orders
Microservice
Payments
Microservice
Fleet location
Microservice
Sessions
Microservice
Coupons
Microservice
Notifications
Microservice
Real-time
Data Lake
Feature
Store
Realtime events
Central Bus
iFood Real-time Data Lake Architecture
▪ Kafka storage is expensive
▪ Retention is limited
▪ Full event history enables
recalculation and backfilling for
features
▪ Delta tables provide a cheap
storage option
▪ Delta tables can double as either
batch or streaming sources
Realtime events
Central Bus
Data Lake Streaming Jobs
Data Lake
Streaming
Delta Table
iFood Feature Store Architecture
Kafka Bus
Real-time
Redis
Storage
Data Lake
Streaming
Delta Table
DynamoDB
Metadata
Aggregation Jobs
Aggregation Jobs
Aggregation Jobs
Historic Backfilling Jobs
Historic Backfilling Jobs
Historic Backfilling Jobs
Historic
Materialization Job
Real-time
Materialization Job
Historic
Delta Table
Storage
iFood Feature Store Architecture
Kafka Bus
Data Lake
Streaming
Delta Table
Aggregation Jobs
Aggregation Jobs
Aggregation Jobs
Real-time
Redis
Storage
DynamoDB
Metadata
Historic Backfilling Jobs
Historic Backfilling Jobs
Historic Backfilling Jobs
Historic
Materialization Job
Real-time
Materialization Job
Historic
Delta Table
Storage
The aggregation jobs
iFood Feature Store Architecture
The aggregation jobs
▪ Features are usually combinations of:
▪ Source - orders stream
▪ Window range - last 30 days
▪ Grouping key - by each user
▪ Value - ticket price
▪ Filter - during lunch
▪ Aggregation type - average
Kafka Bus
Data Lake
Streaming
Delta Table
Aggregation Jobs
Aggregation Jobs
Aggregation Jobs
iFood Feature Store Architecture
The aggregation jobs
▪ With spark streaming, you can
only execute one group by
operation per dataframe/job
▪ Each combination of grouping
key and window range results in a
new dataframe
▪ That means increased costs and
operational complexity
ordersStreamDF
.groupBy(col("user_id"), window(col("order_ts"), "1 day"))
.agg(sum("ticket"))
ordersStreamDF
.groupBy(col("user_id"), window(col("order_ts"), "3 days"))
.agg(sum("ticket"))
ordersStreamDF
.groupBy(col("user_id"), window(col("order_ts"), "7 days"))
.agg(sum("ticket"))
iFood Feature Store Architecture
The aggregation jobs
▪ We store the intermediate state for several
aggregation types for a fixed smaller window
▪ We then combine the results to emit the result
for several window sizes at once
▪ This also allows us use the same code and the
same job to calculate historical and real-time
features
iFood Feature Store Architecture
The aggregation jobs - Two-step aggregation logic
Orders
Streaming
Source
D-6
1
D-5
2
D-4
3
D-3
0
D-2
1
D-1
1
D-0
2
D-6
1
D-5
2
D-4
3
D-3
0
D-2
1
D-1
1
D-0
2
D-6 to D-4
6
D-5 to D-3
5
D-4 to D-2
4
D-6 to D-0
10
D-3 to D-1
2
D-2 to D-0
4
1 day windows
3 days windows
7 days windows
iFood Feature Store Architecture
The aggregation jobs
▪ How to express that?
▪ flatMapGroupsWithState
▪ Flexibility on storing state and
expressing calculation logic
▪ That allows us to combine
dozens of jobs into one
def combineAggregations(
sourceDF: DataFrame,
groupByKeys: Seq[String],
windowStep: Long,
combinationRules: Seq[CombinationRule]): DataFrame = {
putStateAndOutputPlaceholdersToFitCombinedSchema(df)
.groupByKey(row => combineGroupKeys())
.flatMapGroupsWithState((state, miniBatchIterator) => {
miniBatchIterator.foreach(row => {
if (inputWindowEnd() > newestOutputWindowEnd()) {
moveStateRangeForward()
}
if (inputRowIsInStateRange()) {
firstStepUpdateIntermediateValue()
}
})
combinationRules.foreach(combinationRule => {
secondStepCalculateFinalResultBasedOnIntermediateValues()
})
yieldAnOutputRowBasedOnTheResults()
})
}
iFood Feature Store Architecture
The aggregation jobs
Order ID
Customer
ID
Date ...
Customer 1 2020-01-01 ...
Entity Entity ID Date Feat. Name Feat. Value
Customer 1 2020-01-01 NOrders1Day 2
Customer 1 2020-01-01 NOrders3Days 6
Customer 1 2020-01-01 NOrders7Days 10
iFood Feature Store Architecture
Kafka Bus
Real-time
Redis
Storage
Historic
Materialization Job
Real-time
Materialization Job
Historic
Delta Table
Storage
Data Lake
Streaming
Delta Table
Aggregation Jobs
Aggregation Jobs
Aggregation Jobs
DynamoDB
Metadata
Historic Backfilling Jobs
Historic Backfilling Jobs
Historic Backfilling Jobs
The materialization jobs
iFood Feature Store Architecture
The materialization jobs
▪ Feature update commands are
stored to a kafka topic - think
CDC or log tailing
▪ Update feature F for entity E at row R with value V
▪ Using the Delta Table Storage,
we use MERGE INTO and the
map_concat function to be
flexible
Entity Entity ID Date Feat. Name Feat. Value
Customer 1 2020-01-01
AvgTicket
Price30Days
25.8
Entity Entity ID Date Features Map
Customer 1 2020-01-01
AvgTicketPrice30
Days -> 25.8
Entity Entity ID Date Feat. Name Feat. Value
Customer 2 2020-02-01
NOrders30D
ays
17
Entity Entity ID Date Features Map
Customer 1 2020-01-01
AvgTicketPrice30
Days -> 25.8
Customer 2 2020-02-01
NOrders30Days
-> 17
Entity Entity ID Date Feat. Name Feat. Value
Customer 1 2020-01-01
NOrders30D
ays
3
Entity Entity ID Date Features Map
Customer 1 2020-01-01
AvgTicketPrice30
Days -> 25.8
NOrders30Days
-> 3
Customer 2 2020-02-01
NOrders30Days
-> 17
iFood Feature Store Architecture
The materialization jobs
▪ Consumers are free to materialize
them to their database of choice
▪ For ML, we use:
▪ A delta table for historic feature values
▪ A redis cluster for low latency real-time access
Kafka Bus
Real-time
Redis
Storage
Historic
Materialization Job
Real-time
Materialization Job
Historic
Delta Table
Storage
iFood Feature Store Architecture
Real-time
Redis
Storage
Historic
Materialization Job
Real-time
Materialization Job
Historic
Delta Table
Storage
Kafka Bus
Data Lake
Streaming
Delta Table
Aggregation Jobs
Aggregation Jobs
Aggregation Jobs
DynamoDB
Metadata
Historic Backfilling Jobs
Historic Backfilling Jobs
Historic Backfilling Jobs
The backfilling jobs
iFood Feature Store Architecture
The backfilling jobs
▪ How to calculate features for streaming
data registered before the creation of a
feature?
▪ Use a metadata database to store the
creation time of each feature
▪ Run a backfilling job to create feature
values up to the feature creation
▪ Start the streaming job to emit results
using values that arrive after the
creation date
Kafka Bus
Data Lake
Streaming
Delta Table
Aggregation Jobs
Aggregation Jobs
Aggregation Jobs
DynamoDB
Metadata
Historic Backfilling Jobs
Historic Backfilling Jobs
Historic Backfilling Jobs
Lessons learned & Best practices
▪ Delta Tables double as streaming or batch sources
▪ OPTIMIZE is a must for streaming jobs saving to a
Delta Table
▪ Either on auto mode, or as a separate process
▪ When starting a brand new job from a streaming
delta table source, the files reading order is not
guaranteed
▪ This is even more noticeable after running OPTIMIZE (which you should!)
▪ If the events processing order is important for your job, either use Trigger.Once to process the
first historical batch, or process each partition sequentially in order
Lessons learned & Best practices
▪ flatMapGroupsWithState is really powerful
▪ State management should be handled with care
▪ foreachBatch is really powerful
▪ Please note it can be triggered on an empty Dataframe, though
▪ Be sure to use correct partition pruning
when using the MERGE INTO operation
▪ Be careful with parameter changes
between job restarts
▪ StreamTest really helps with unit tests,
debugging and raising the bar
Positive outcomes
▪ Unified codebase for historical and real-time
features - 50% less code
▪ Unified jobs for historical and real-time
features - from dozens of jobs to around 10
▪ Huge batch ETL jobs are substituted by much
smaller streaming clusters
▪ Though they run 24/7
▪ Delta tables allow for isolation between read
and write operations
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Building a Real-Time Feature Store at iFood

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufWebinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufVerverica
 
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...HostedbyConfluent
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...HostedbyConfluent
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies DataWorks Summit/Hadoop Summit
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSDatabricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
ETL Using Informatica Power Center
ETL Using Informatica Power CenterETL Using Informatica Power Center
ETL Using Informatica Power CenterEdureka!
 

Was ist angesagt? (20)

Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufWebinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
 
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Oracle GoldenGate
Oracle GoldenGate Oracle GoldenGate
Oracle GoldenGate
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
ETL Using Informatica Power Center
ETL Using Informatica Power CenterETL Using Informatica Power Center
ETL Using Informatica Power Center
 

Ähnlich wie Building a Real-Time Feature Store at iFood

Couchbase@live person meetup july 22nd
Couchbase@live person meetup   july 22ndCouchbase@live person meetup   july 22nd
Couchbase@live person meetup july 22ndIdo Shilon
 
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at UberKarthik Murugesan
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationYi Pan
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData
 
FMK2019 being an optimist in a pessimistic world by vincenzo menanno
FMK2019 being an optimist in a pessimistic world by vincenzo menannoFMK2019 being an optimist in a pessimistic world by vincenzo menanno
FMK2019 being an optimist in a pessimistic world by vincenzo menannoVerein FM Konferenz
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase DataWorks Summit
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextPrateek Maheshwari
 
Java Enterprise Performance - Unburdended Applications
Java Enterprise Performance - Unburdended ApplicationsJava Enterprise Performance - Unburdended Applications
Java Enterprise Performance - Unburdended ApplicationsLucas Jellema
 
Working with data using Azure Functions.pdf
Working with data using Azure Functions.pdfWorking with data using Azure Functions.pdf
Working with data using Azure Functions.pdfStephanie Locke
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
Aggregations at Scale for ShareChat —Using Kafka Streams and ScyllaDB
Aggregations at Scale for ShareChat —Using Kafka Streams and ScyllaDBAggregations at Scale for ShareChat —Using Kafka Streams and ScyllaDB
Aggregations at Scale for ShareChat —Using Kafka Streams and ScyllaDBScyllaDB
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyYaroslav Tkachenko
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleJim Dowling
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSSmartNews, Inc.
 
Scaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastScaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastDatabricks
 
Ultimate+SnowPro+Core+Certification+Course+Slides+by+Tom+Bailey (1).pdf
Ultimate+SnowPro+Core+Certification+Course+Slides+by+Tom+Bailey (1).pdfUltimate+SnowPro+Core+Certification+Course+Slides+by+Tom+Bailey (1).pdf
Ultimate+SnowPro+Core+Certification+Course+Slides+by+Tom+Bailey (1).pdfchanti29
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 

Ähnlich wie Building a Real-Time Feature Store at iFood (20)

Couchbase@live person meetup july 22nd
Couchbase@live person meetup   july 22ndCouchbase@live person meetup   july 22nd
Couchbase@live person meetup july 22nd
 
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15
 
FMK2019 being an optimist in a pessimistic world by vincenzo menanno
FMK2019 being an optimist in a pessimistic world by vincenzo menannoFMK2019 being an optimist in a pessimistic world by vincenzo menanno
FMK2019 being an optimist in a pessimistic world by vincenzo menanno
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
Nike tech talk.2
Nike tech talk.2Nike tech talk.2
Nike tech talk.2
 
Java Enterprise Performance - Unburdended Applications
Java Enterprise Performance - Unburdended ApplicationsJava Enterprise Performance - Unburdended Applications
Java Enterprise Performance - Unburdended Applications
 
Working with data using Azure Functions.pdf
Working with data using Azure Functions.pdfWorking with data using Azure Functions.pdf
Working with data using Azure Functions.pdf
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Aggregations at Scale for ShareChat —Using Kafka Streams and ScyllaDB
Aggregations at Scale for ShareChat —Using Kafka Streams and ScyllaDBAggregations at Scale for ShareChat —Using Kafka Streams and ScyllaDB
Aggregations at Scale for ShareChat —Using Kafka Streams and ScyllaDB
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWS
 
Scaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastScaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and Feast
 
Ultimate+SnowPro+Core+Certification+Course+Slides+by+Tom+Bailey (1).pdf
Ultimate+SnowPro+Core+Certification+Course+Slides+by+Tom+Bailey (1).pdfUltimate+SnowPro+Core+Certification+Course+Slides+by+Tom+Bailey (1).pdf
Ultimate+SnowPro+Core+Certification+Course+Slides+by+Tom+Bailey (1).pdf
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Kürzlich hochgeladen

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 

Kürzlich hochgeladen (20)

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 

Building a Real-Time Feature Store at iFood

  • 1.
  • 2. Building a Realtime Feature Store at iFood Daniel Galinkin ML Platform Tech Lead
  • 3. Agenda iFood and AI What is the iFood mission, and how we use AI What is a Feature Store What is a Feature Store, and why it is important to solve AI problems How iFood built its Feature Store How iFood built its Feature Store by leveraging Spark, Databricks and Delta Tables
  • 5. BIGGEST FOODTECH IN LATIN AMERICA (we’re in Brazil, Mexico and Colombia) ~30 million orders per month +800 cities in all Brazilian states +100 thousand restaurants
  • 6. AI Everywhere ▪ Restaurants recommendations ▪ Dishes recommendations ▪ Optimize the drivers allocation ▪ Estimate the delivery time ▪ Find the most efficient route LogisticsDiscovery ▪ Optimize the use of marketing ads ▪ Optimize the use of coupons Marketing
  • 7. ML Platform Configuration Data Collection Feature Extraction ML Data Verification Analysis Tools Machine Resource Management Process Management Tools Serving Infrastructure Monitoring
  • 8. What is a Feature Store
  • 9. What are features? ▪ Any kind of data used to train a ML model ▪ Feature types: ▪ State features ▪ Did the user have a coupon at the time? ▪ Aggregate features ▪ Average ticket price in the last 30 days for the user ▪ External features ▪ Was it raining at the time?
  • 10. What is a feature store? ▪ The feature store is the central place in an organization to query for features ▪ Features are mostly used by machine learning algorithms ▪ They can also be useful for other applications ▪ For example, you could use the average ticket price for a user to show a high end or low end list of restaurants
  • 11. Feature store requirements ▪ General: ▪ Low latency ▪ Access & Calculation ▪ Access control ▪ Versioning ▪ Scalability ▪ Easy API for data access ▪ Machine Learning: ▪ Backfilling ▪ “Time-travel” - snapshot for historical feature values
  • 12. How iFood built its Feature Store
  • 13. Feature Store Aggregation Service iFood Software Architecture Streaming as a first-class citizen Orders Microservice Payments Microservice Fleet location Microservice Sessions Microservice Coupons Microservice Notifications Microservice Real-time Data Lake Feature Store Realtime events Central Bus
  • 14. iFood Real-time Data Lake Architecture ▪ Kafka storage is expensive ▪ Retention is limited ▪ Full event history enables recalculation and backfilling for features ▪ Delta tables provide a cheap storage option ▪ Delta tables can double as either batch or streaming sources Realtime events Central Bus Data Lake Streaming Jobs Data Lake Streaming Delta Table
  • 15. iFood Feature Store Architecture Kafka Bus Real-time Redis Storage Data Lake Streaming Delta Table DynamoDB Metadata Aggregation Jobs Aggregation Jobs Aggregation Jobs Historic Backfilling Jobs Historic Backfilling Jobs Historic Backfilling Jobs Historic Materialization Job Real-time Materialization Job Historic Delta Table Storage
  • 16. iFood Feature Store Architecture Kafka Bus Data Lake Streaming Delta Table Aggregation Jobs Aggregation Jobs Aggregation Jobs Real-time Redis Storage DynamoDB Metadata Historic Backfilling Jobs Historic Backfilling Jobs Historic Backfilling Jobs Historic Materialization Job Real-time Materialization Job Historic Delta Table Storage The aggregation jobs
  • 17. iFood Feature Store Architecture The aggregation jobs ▪ Features are usually combinations of: ▪ Source - orders stream ▪ Window range - last 30 days ▪ Grouping key - by each user ▪ Value - ticket price ▪ Filter - during lunch ▪ Aggregation type - average Kafka Bus Data Lake Streaming Delta Table Aggregation Jobs Aggregation Jobs Aggregation Jobs
  • 18. iFood Feature Store Architecture The aggregation jobs ▪ With spark streaming, you can only execute one group by operation per dataframe/job ▪ Each combination of grouping key and window range results in a new dataframe ▪ That means increased costs and operational complexity ordersStreamDF .groupBy(col("user_id"), window(col("order_ts"), "1 day")) .agg(sum("ticket")) ordersStreamDF .groupBy(col("user_id"), window(col("order_ts"), "3 days")) .agg(sum("ticket")) ordersStreamDF .groupBy(col("user_id"), window(col("order_ts"), "7 days")) .agg(sum("ticket"))
  • 19. iFood Feature Store Architecture The aggregation jobs ▪ We store the intermediate state for several aggregation types for a fixed smaller window ▪ We then combine the results to emit the result for several window sizes at once ▪ This also allows us use the same code and the same job to calculate historical and real-time features
  • 20. iFood Feature Store Architecture The aggregation jobs - Two-step aggregation logic Orders Streaming Source D-6 1 D-5 2 D-4 3 D-3 0 D-2 1 D-1 1 D-0 2 D-6 1 D-5 2 D-4 3 D-3 0 D-2 1 D-1 1 D-0 2 D-6 to D-4 6 D-5 to D-3 5 D-4 to D-2 4 D-6 to D-0 10 D-3 to D-1 2 D-2 to D-0 4 1 day windows 3 days windows 7 days windows
  • 21. iFood Feature Store Architecture The aggregation jobs ▪ How to express that? ▪ flatMapGroupsWithState ▪ Flexibility on storing state and expressing calculation logic ▪ That allows us to combine dozens of jobs into one def combineAggregations( sourceDF: DataFrame, groupByKeys: Seq[String], windowStep: Long, combinationRules: Seq[CombinationRule]): DataFrame = { putStateAndOutputPlaceholdersToFitCombinedSchema(df) .groupByKey(row => combineGroupKeys()) .flatMapGroupsWithState((state, miniBatchIterator) => { miniBatchIterator.foreach(row => { if (inputWindowEnd() > newestOutputWindowEnd()) { moveStateRangeForward() } if (inputRowIsInStateRange()) { firstStepUpdateIntermediateValue() } }) combinationRules.foreach(combinationRule => { secondStepCalculateFinalResultBasedOnIntermediateValues() }) yieldAnOutputRowBasedOnTheResults() }) }
  • 22. iFood Feature Store Architecture The aggregation jobs Order ID Customer ID Date ... Customer 1 2020-01-01 ... Entity Entity ID Date Feat. Name Feat. Value Customer 1 2020-01-01 NOrders1Day 2 Customer 1 2020-01-01 NOrders3Days 6 Customer 1 2020-01-01 NOrders7Days 10
  • 23. iFood Feature Store Architecture Kafka Bus Real-time Redis Storage Historic Materialization Job Real-time Materialization Job Historic Delta Table Storage Data Lake Streaming Delta Table Aggregation Jobs Aggregation Jobs Aggregation Jobs DynamoDB Metadata Historic Backfilling Jobs Historic Backfilling Jobs Historic Backfilling Jobs The materialization jobs
  • 24. iFood Feature Store Architecture The materialization jobs ▪ Feature update commands are stored to a kafka topic - think CDC or log tailing ▪ Update feature F for entity E at row R with value V ▪ Using the Delta Table Storage, we use MERGE INTO and the map_concat function to be flexible Entity Entity ID Date Feat. Name Feat. Value Customer 1 2020-01-01 AvgTicket Price30Days 25.8 Entity Entity ID Date Features Map Customer 1 2020-01-01 AvgTicketPrice30 Days -> 25.8 Entity Entity ID Date Feat. Name Feat. Value Customer 2 2020-02-01 NOrders30D ays 17 Entity Entity ID Date Features Map Customer 1 2020-01-01 AvgTicketPrice30 Days -> 25.8 Customer 2 2020-02-01 NOrders30Days -> 17 Entity Entity ID Date Feat. Name Feat. Value Customer 1 2020-01-01 NOrders30D ays 3 Entity Entity ID Date Features Map Customer 1 2020-01-01 AvgTicketPrice30 Days -> 25.8 NOrders30Days -> 3 Customer 2 2020-02-01 NOrders30Days -> 17
  • 25. iFood Feature Store Architecture The materialization jobs ▪ Consumers are free to materialize them to their database of choice ▪ For ML, we use: ▪ A delta table for historic feature values ▪ A redis cluster for low latency real-time access Kafka Bus Real-time Redis Storage Historic Materialization Job Real-time Materialization Job Historic Delta Table Storage
  • 26. iFood Feature Store Architecture Real-time Redis Storage Historic Materialization Job Real-time Materialization Job Historic Delta Table Storage Kafka Bus Data Lake Streaming Delta Table Aggregation Jobs Aggregation Jobs Aggregation Jobs DynamoDB Metadata Historic Backfilling Jobs Historic Backfilling Jobs Historic Backfilling Jobs The backfilling jobs
  • 27. iFood Feature Store Architecture The backfilling jobs ▪ How to calculate features for streaming data registered before the creation of a feature? ▪ Use a metadata database to store the creation time of each feature ▪ Run a backfilling job to create feature values up to the feature creation ▪ Start the streaming job to emit results using values that arrive after the creation date Kafka Bus Data Lake Streaming Delta Table Aggregation Jobs Aggregation Jobs Aggregation Jobs DynamoDB Metadata Historic Backfilling Jobs Historic Backfilling Jobs Historic Backfilling Jobs
  • 28. Lessons learned & Best practices ▪ Delta Tables double as streaming or batch sources ▪ OPTIMIZE is a must for streaming jobs saving to a Delta Table ▪ Either on auto mode, or as a separate process ▪ When starting a brand new job from a streaming delta table source, the files reading order is not guaranteed ▪ This is even more noticeable after running OPTIMIZE (which you should!) ▪ If the events processing order is important for your job, either use Trigger.Once to process the first historical batch, or process each partition sequentially in order
  • 29. Lessons learned & Best practices ▪ flatMapGroupsWithState is really powerful ▪ State management should be handled with care ▪ foreachBatch is really powerful ▪ Please note it can be triggered on an empty Dataframe, though ▪ Be sure to use correct partition pruning when using the MERGE INTO operation ▪ Be careful with parameter changes between job restarts ▪ StreamTest really helps with unit tests, debugging and raising the bar
  • 30. Positive outcomes ▪ Unified codebase for historical and real-time features - 50% less code ▪ Unified jobs for historical and real-time features - from dozens of jobs to around 10 ▪ Huge batch ETL jobs are substituted by much smaller streaming clusters ▪ Though they run 24/7 ▪ Delta tables allow for isolation between read and write operations
  • 31. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.