SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
Michelangelo Palette
Feature Engineering @ Uber
Amit Nene
Staff Engineer,
Michelangelo ML
Platform
Eric Chen
Engineering Manager,
Michelangelo ML
Platform
Enable engineers and data scientists across the
company to easily build and deploy machine learning
solutions at scale.
ML-as-a-service
○ Managing Data/Features
○ Tools for managing, end-to-end, heterogenous
training workflows
○ Batch, online & mobile serving
○ Feature and Model drift monitoring
Michelangelo @ Uber
MANAGE DATA
TRAIN MODELS
EVALUATE MODELS
DEPLOY MODELS
MAKE PREDICTIONS
MONITOR PREDICTIONS
Feature Engineering @ Uber
○ Example: ETA for EATS order
○ Key ML features
○ How large is the order?
○ How busy is the restaurant?
○ How quick is the restaurant?
○ How busy is the traffic?
Managing Features
One of the hardest problems in ML
○ Finding good Features & labels
○ Data in production: reliability, scale, low latency
○ Data parity: training/serving skew
○ Real-time features: traditional tools don’t work
Palette Feature Store
Uber-specific curated and crowd-sourced feature database that is easy to use with machine
learning projects.
One stop shop
○ Search for features in single catalog/spec: rider, driver, restaurant, trip, eaters, etc.
○ Define new features + create production pipelines from spec
○ Share features across Uber: cut redundancy, use consistent data
○ Enable tooling: Data Drift Detection, Auto Feature Selection, etc.
Feature Store Organization
Organized as <entity>:<feature-group>:<feature-name>:<join-key>
Eg. @palette:restaurant:realtime_group:orders_last_30min:restaurant_uuid
Backed by a dual datastore system: similarities to lambda
○ Offline
○ Offline (Hive based) store for bulk access of features
○ Bulk retrieval of features across time
○ Online
○ KV store (Cassandra) for serving latest known value
○ Supports lookup/join of latest feature values in real time
○ Data synced between online & offline
○ Key to avoiding training/serving skew
EATS Features revisited
○ How large is the order? ← Input
○ How busy is the restaurant?
○ How quick is the restaurant?
○ How busy is the traffic?
Creating Batch Features
Offline Batch jobs
Features join
Model
Training
job
Online
Store
(Serving)
Offline
Store
(Training)
Features join
Model
Scoring
Service
Data
dispersal
Feature
Store
Palette
Feature
spec
General trends, not sensitive to
exact time of event
Ingested from Hive queries or
Spark jobs
How quick is the restaurant ?
○ Aggregate trends
○ Use Hive QL from warehouse
○ @palette:restaurant:batch_aggr:
prepTime:rId
Hive QL
Apache Hive and Apache Spark are either registered trademarks or
trademarks of the Apache Software Foundation in the United
States and/or other countries. No endorsement by The Apache
Software Foundation is implied by the use of this mark.
Creating Real-time Features
Flink-as-service
Streaming jobs
Features join
Model
Scoring
Service
Offline
Store
(Training)
Online
Store
(Serving)
Features join
Model
Training
job
Log +
Backfill
Feature
Store
Features reflecting the latest
state of the world
Ingest from streaming jobs
How busy is the restaurant ?
○ kafka topic with events
○ perform realtime aggregations
○ @palette:restaurant:rt_aggr:nMeal:
rId
Palette
Feature
spec
Apache Flink is either a registered trademark or trademark of the
Apache Software Foundation in the United States and/or other
countries. No endorsement by The Apache Software Foundation is
implied by the use of this mark.
Flink SQL
Bring Your Own Features
Feature maintained by customers
Mechanisms for hooking
serving/training endpoints
Users maintain data parity
How busy is the region ?
○ RPC: external traffic feed
○ Log RPCs for training
○ @palette:region:traffic:nBusy:regionId
Features join
Model
Scoring
Service
Features join
Model
Training
job
Palette
Feature
spec
Offline
Proxy
Online
Proxy
Custom
store
Batch
API
Service
endpoint
RPC
Palette Feature Joins
Join @basis features
with supplied
@palette features into
single feature vector
Join billion+ rows at
points-in-time:
dominates overhead
Join/Lookup 10s of
tables at serving time
at low latency
order_i
d
nOrder restaur
ant_uui
d
latlong Label
ETA
(trainin
g)
timesta
mp
1 4 uuid1 (10,20) 40m t1
2 3 uuid2 (30,40) 35m t2
rId prepTime timestamp
uuid1 20m t1
uuid2 15m t2
join_key = rId
@basis features
training/scoring feature vector
Time + Key
join
order
_id
rId latlong Label
ETA
(trainin
g)
prepT
ime
..
1 uuid1 (10,20) 40m 20m ..
2 uuid2 (30,40) 35m 15m ..
@palette:restaurant:agg_stats:prepTime:
restaurant_uuid
@palette:restaurant:re
altime_stats:nBusy:rId
@palette:region:stats:n
Busy:regionId
Done with Feature Engineering ?
○ Feature Store Features
○ nOrder: How large is the order? (basis)
○ nMeal: How busy is the restaurant? (near real-time)
○ prepTime: How quick is the restaurant? (batch feature)
○ nBusy: How busy is the traffic? (external feature)
○ Ready to use ?
○ Model specific feature transformations
○ Chaining of features
○ Feature Transformers
Feature Consumption
Feature Store Features
○ nOrder: input feature
○ nMeal: consume directly
○ prepTime: needs transformation before use
○ nBusy: input latlong but need regionId
Setting up consumption pipelines
○ nMeal: r_id -> nMeal
○ prepTime: r_id -> prepTime -> featureImpute
○ nBusy: r_id -> lat, log -> regionId(lat, log) -> nBusy
In arbitrary order
Michelangelo Transformers
Transformer: Given a record defined a set of fields, add/modify/remove fields in the record
PipelineModel: A sequence of transformers
Spark ML: Transformer/PipelineModel on DataFrames
Michelangelo Transformers: extended transformer framework for both Apache Spark and
Apache Spark-less environments
Estimator: Analyze the data and produce a transformer
Defining a Pipeline Model
Join Palette Features
Apply Feature Eng Rules
String Indexing
One-Hot Encoding
DL Inferencing
Result Retrieval
Feature consumption
○ Feature extraction: Palette feature retrieval expressed as a transform
○ Feature mutation: Scala-like DSL for simple transforms
○ Model-centric Feature Engineering: string indexer, one-hot encoder,
threshold decision
○ Result retrieval
Modeling
○ Model inferencing (also Michelangelo Transformer)
Michelangelo Transformers Example
class MyModel (override val uid: String) extends Model[MyModel] with
MyModelParam with MLWritable with MATransformer {
...
override def transform(dataset: Dataset[_]): DataFrame = ...
override def scoreInstance(instance: util.Map[String, Object]): util.Map[String,
Object] = ...
}
class MyEstimator(override val uid: String) extends
Estimator[MyEstimator] with Params with DefaultParamsWritable {
...
override def fit(dataset: Dataset[_]): MyModel = ...
}
Palette retrieval as a Transformer
Palette Feature Transformer
Feature Meta Store
RPC Feature Proxy
Cassandra Access
Hive Access
tx_p1 = PaletteTransformer([
"@palette:restaurant:realtime_feature:nMeal:r_id",
"@palette:restaurant:batch_feature:prepTime:r_id",
"@palette:restaurant:property:lat:r_id",
"@palette:restaurant:property:log:r_id"
])
tx_p2 = PaletteTransformer([
"@palette:region:service_feature:nBusy:region_id"
])
DSL Estimator / Transformer
DSL Estimator
Code Gen / Compiler
DSL Transformer
Online classloader Offline classloader
es_dsl1 = DSLEstimator(lambdas = [
["region_id", "regionId(@palette:restaurant:property:lat:r_id,
@palette:restaurant:property:r_id"]
])
es_dsl2 = DSLEstimator(lambdas = [
["prepTime": nFill(nVal("@palette:restaurant:batch_feature:prepTime:r_id"),
avg("@palette:restaurant:batch_feature:prepTime:r_id")))"],
["nMeal": nVal("@palette:restaurant:realtime_feature:nMean:r_id")],
["nOrder": nVal("@basis:nOrder")],
["nBusy": nVal("@palette:region:service_feature:nBusy:region_id")]
])
Uber Eats Example Cont.
Computation order
○ nMeal: rId -> nMeal
○ prepTime: rId -> prepTime -> featureImpute
○ busyScale: rId -> lat, log -> regionId(lat, log) -> busyScale
Palette Transformer
id -> nMeal
id -> prepTime
id -> lat, log
DSL Transformer
lag, log -> regionID
Palette Transformer
regionID -> nBusy
DSL Transformer
impute(nMeal)
impute(prepTime)
Dev Tools: Authoring and Debugging a Pipeline
Palette feature generation
● Apache Hive QL, Apache Flink SQL
Interactive authoring
● PySpark + iPython Jupyter notebook
Centralized model store
● Serialization / Deserialization (Spark ML,
MLReadable/Writeable)
● Online and offline accessibility
basis_feature_sql = "..."
df = spark.sql(basis_feature_sql)
pipeline = Pipeline(stages=[tx_p1, es_dsl1, tx_p2, es_dsl2t, vec_asm, l_r)
pipeline_model = pipeline.fit(df)
scored_def = pipeline_model.transform(df)
model_id = MA_store.save_model(pipeline_model)
draft_id = MA_store.save_pipeline(basis_feature_sql, pipeline)
retrain_job = MA_API.train(draft_id, new_basis_feature_sql)
Takeaways
Feature Store: Batch, Realtime and External Features with online and offline parity
Offline scalability: Joins across billions of rows
Online serving latency: Parallel IO, fast storage with caching
Feature Transformers: Setup chains of transformations at training/serving time
Pipeline reliability and monitoring out-of-the-box
Thank you
https://eng.uber.com/michelangelo/
https://www.uber.com/careers/

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systems
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Building an Enterprise Knowledge Graph @Uber: Lessons from Reality
Building an Enterprise Knowledge Graph @Uber: Lessons from RealityBuilding an Enterprise Knowledge Graph @Uber: Lessons from Reality
Building an Enterprise Knowledge Graph @Uber: Lessons from Reality
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI Ecosystem[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI Ecosystem
 
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
 
Building Your Data Streams for all the IoT
Building Your Data Streams for all the IoTBuilding Your Data Streams for all the IoT
Building Your Data Streams for all the IoT
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
kafka
kafkakafka
kafka
 

Ähnlich wie 2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber

Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Data Con LA
 

Ähnlich wie 2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber (20)

Building a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFoodBuilding a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFood
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
 
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
 
KFServing and Feast
KFServing and FeastKFServing and Feast
KFServing and Feast
 
Simplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data Warehouse
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Sprint 49 review
Sprint 49 reviewSprint 49 review
Sprint 49 review
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
 
Sprint 59
Sprint 59Sprint 59
Sprint 59
 
Sprint 54
Sprint 54Sprint 54
Sprint 54
 
Sprint 53
Sprint 53Sprint 53
Sprint 53
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache Flink
 
Scaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastScaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and Feast
 
Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016Scaling machinelearning as a service at uber li Erran li - 2016
Scaling machinelearning as a service at uber li Erran li - 2016
 
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
 

Mehr von Karthik Murugesan

BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
Karthik Murugesan
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Karthik Murugesan
 

Mehr von Karthik Murugesan (20)

Rakuten - Recommendation Platform
Rakuten - Recommendation PlatformRakuten - Recommendation Platform
Rakuten - Recommendation Platform
 
Yahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesYahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slides
 
Free servers to build Big Data Systems on: Bing's Approach
Free servers to build Big Data Systems on: Bing's  Approach Free servers to build Big Data Systems on: Bing's  Approach
Free servers to build Big Data Systems on: Bing's Approach
 
Microsoft cosmos
Microsoft cosmosMicrosoft cosmos
Microsoft cosmos
 
Microsoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER IntroductionMicrosoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER Introduction
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019
 
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
Unifying Twitter around a single ML platform  - Twitter AI Platform 2019Unifying Twitter around a single ML platform  - Twitter AI Platform 2019
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
 
The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019
 
Developing a ML model using TF Estimator
Developing a ML model using TF EstimatorDeveloping a ML model using TF Estimator
Developing a ML model using TF Estimator
 
Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018
 
Netflix factstore for recommendations - 2018
Netflix factstore  for recommendations - 2018Netflix factstore  for recommendations - 2018
Netflix factstore for recommendations - 2018
 
Trends in Music Recommendations 2018
Trends in Music Recommendations 2018Trends in Music Recommendations 2018
Trends in Music Recommendations 2018
 
Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017
 
State Of AI 2018
State Of AI 2018State Of AI 2018
State Of AI 2018
 
Spotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoverySpotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music Discovery
 
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
 

Kürzlich hochgeladen

( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
nilamkumrai
 
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Chandigarh Call girls 9053900678 Call girls in Chandigarh
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
nirzagarg
 

Kürzlich hochgeladen (20)

(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...
Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...
Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...
 
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 
Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...
Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...
Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
Russian Call Girls in %(+971524965298  )#  Call Girls in DubaiRussian Call Girls in %(+971524965298  )#  Call Girls in Dubai
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls DubaiDubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
Real Escorts in Al Nahda +971524965298 Dubai Escorts Service
Real Escorts in Al Nahda +971524965298 Dubai Escorts ServiceReal Escorts in Al Nahda +971524965298 Dubai Escorts Service
Real Escorts in Al Nahda +971524965298 Dubai Escorts Service
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 

2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber

  • 1. Michelangelo Palette Feature Engineering @ Uber Amit Nene Staff Engineer, Michelangelo ML Platform Eric Chen Engineering Manager, Michelangelo ML Platform
  • 2. Enable engineers and data scientists across the company to easily build and deploy machine learning solutions at scale. ML-as-a-service ○ Managing Data/Features ○ Tools for managing, end-to-end, heterogenous training workflows ○ Batch, online & mobile serving ○ Feature and Model drift monitoring Michelangelo @ Uber MANAGE DATA TRAIN MODELS EVALUATE MODELS DEPLOY MODELS MAKE PREDICTIONS MONITOR PREDICTIONS
  • 3. Feature Engineering @ Uber ○ Example: ETA for EATS order ○ Key ML features ○ How large is the order? ○ How busy is the restaurant? ○ How quick is the restaurant? ○ How busy is the traffic?
  • 4. Managing Features One of the hardest problems in ML ○ Finding good Features & labels ○ Data in production: reliability, scale, low latency ○ Data parity: training/serving skew ○ Real-time features: traditional tools don’t work
  • 5. Palette Feature Store Uber-specific curated and crowd-sourced feature database that is easy to use with machine learning projects. One stop shop ○ Search for features in single catalog/spec: rider, driver, restaurant, trip, eaters, etc. ○ Define new features + create production pipelines from spec ○ Share features across Uber: cut redundancy, use consistent data ○ Enable tooling: Data Drift Detection, Auto Feature Selection, etc.
  • 6. Feature Store Organization Organized as <entity>:<feature-group>:<feature-name>:<join-key> Eg. @palette:restaurant:realtime_group:orders_last_30min:restaurant_uuid Backed by a dual datastore system: similarities to lambda ○ Offline ○ Offline (Hive based) store for bulk access of features ○ Bulk retrieval of features across time ○ Online ○ KV store (Cassandra) for serving latest known value ○ Supports lookup/join of latest feature values in real time ○ Data synced between online & offline ○ Key to avoiding training/serving skew
  • 7. EATS Features revisited ○ How large is the order? ← Input ○ How busy is the restaurant? ○ How quick is the restaurant? ○ How busy is the traffic?
  • 8. Creating Batch Features Offline Batch jobs Features join Model Training job Online Store (Serving) Offline Store (Training) Features join Model Scoring Service Data dispersal Feature Store Palette Feature spec General trends, not sensitive to exact time of event Ingested from Hive queries or Spark jobs How quick is the restaurant ? ○ Aggregate trends ○ Use Hive QL from warehouse ○ @palette:restaurant:batch_aggr: prepTime:rId Hive QL Apache Hive and Apache Spark are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of this mark.
  • 9. Creating Real-time Features Flink-as-service Streaming jobs Features join Model Scoring Service Offline Store (Training) Online Store (Serving) Features join Model Training job Log + Backfill Feature Store Features reflecting the latest state of the world Ingest from streaming jobs How busy is the restaurant ? ○ kafka topic with events ○ perform realtime aggregations ○ @palette:restaurant:rt_aggr:nMeal: rId Palette Feature spec Apache Flink is either a registered trademark or trademark of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of this mark. Flink SQL
  • 10. Bring Your Own Features Feature maintained by customers Mechanisms for hooking serving/training endpoints Users maintain data parity How busy is the region ? ○ RPC: external traffic feed ○ Log RPCs for training ○ @palette:region:traffic:nBusy:regionId Features join Model Scoring Service Features join Model Training job Palette Feature spec Offline Proxy Online Proxy Custom store Batch API Service endpoint RPC
  • 11. Palette Feature Joins Join @basis features with supplied @palette features into single feature vector Join billion+ rows at points-in-time: dominates overhead Join/Lookup 10s of tables at serving time at low latency order_i d nOrder restaur ant_uui d latlong Label ETA (trainin g) timesta mp 1 4 uuid1 (10,20) 40m t1 2 3 uuid2 (30,40) 35m t2 rId prepTime timestamp uuid1 20m t1 uuid2 15m t2 join_key = rId @basis features training/scoring feature vector Time + Key join order _id rId latlong Label ETA (trainin g) prepT ime .. 1 uuid1 (10,20) 40m 20m .. 2 uuid2 (30,40) 35m 15m .. @palette:restaurant:agg_stats:prepTime: restaurant_uuid @palette:restaurant:re altime_stats:nBusy:rId @palette:region:stats:n Busy:regionId
  • 12. Done with Feature Engineering ? ○ Feature Store Features ○ nOrder: How large is the order? (basis) ○ nMeal: How busy is the restaurant? (near real-time) ○ prepTime: How quick is the restaurant? (batch feature) ○ nBusy: How busy is the traffic? (external feature) ○ Ready to use ? ○ Model specific feature transformations ○ Chaining of features ○ Feature Transformers
  • 13. Feature Consumption Feature Store Features ○ nOrder: input feature ○ nMeal: consume directly ○ prepTime: needs transformation before use ○ nBusy: input latlong but need regionId Setting up consumption pipelines ○ nMeal: r_id -> nMeal ○ prepTime: r_id -> prepTime -> featureImpute ○ nBusy: r_id -> lat, log -> regionId(lat, log) -> nBusy In arbitrary order
  • 14. Michelangelo Transformers Transformer: Given a record defined a set of fields, add/modify/remove fields in the record PipelineModel: A sequence of transformers Spark ML: Transformer/PipelineModel on DataFrames Michelangelo Transformers: extended transformer framework for both Apache Spark and Apache Spark-less environments Estimator: Analyze the data and produce a transformer
  • 15. Defining a Pipeline Model Join Palette Features Apply Feature Eng Rules String Indexing One-Hot Encoding DL Inferencing Result Retrieval Feature consumption ○ Feature extraction: Palette feature retrieval expressed as a transform ○ Feature mutation: Scala-like DSL for simple transforms ○ Model-centric Feature Engineering: string indexer, one-hot encoder, threshold decision ○ Result retrieval Modeling ○ Model inferencing (also Michelangelo Transformer)
  • 16. Michelangelo Transformers Example class MyModel (override val uid: String) extends Model[MyModel] with MyModelParam with MLWritable with MATransformer { ... override def transform(dataset: Dataset[_]): DataFrame = ... override def scoreInstance(instance: util.Map[String, Object]): util.Map[String, Object] = ... } class MyEstimator(override val uid: String) extends Estimator[MyEstimator] with Params with DefaultParamsWritable { ... override def fit(dataset: Dataset[_]): MyModel = ... }
  • 17. Palette retrieval as a Transformer Palette Feature Transformer Feature Meta Store RPC Feature Proxy Cassandra Access Hive Access tx_p1 = PaletteTransformer([ "@palette:restaurant:realtime_feature:nMeal:r_id", "@palette:restaurant:batch_feature:prepTime:r_id", "@palette:restaurant:property:lat:r_id", "@palette:restaurant:property:log:r_id" ]) tx_p2 = PaletteTransformer([ "@palette:region:service_feature:nBusy:region_id" ])
  • 18. DSL Estimator / Transformer DSL Estimator Code Gen / Compiler DSL Transformer Online classloader Offline classloader es_dsl1 = DSLEstimator(lambdas = [ ["region_id", "regionId(@palette:restaurant:property:lat:r_id, @palette:restaurant:property:r_id"] ]) es_dsl2 = DSLEstimator(lambdas = [ ["prepTime": nFill(nVal("@palette:restaurant:batch_feature:prepTime:r_id"), avg("@palette:restaurant:batch_feature:prepTime:r_id")))"], ["nMeal": nVal("@palette:restaurant:realtime_feature:nMean:r_id")], ["nOrder": nVal("@basis:nOrder")], ["nBusy": nVal("@palette:region:service_feature:nBusy:region_id")] ])
  • 19. Uber Eats Example Cont. Computation order ○ nMeal: rId -> nMeal ○ prepTime: rId -> prepTime -> featureImpute ○ busyScale: rId -> lat, log -> regionId(lat, log) -> busyScale Palette Transformer id -> nMeal id -> prepTime id -> lat, log DSL Transformer lag, log -> regionID Palette Transformer regionID -> nBusy DSL Transformer impute(nMeal) impute(prepTime)
  • 20. Dev Tools: Authoring and Debugging a Pipeline Palette feature generation ● Apache Hive QL, Apache Flink SQL Interactive authoring ● PySpark + iPython Jupyter notebook Centralized model store ● Serialization / Deserialization (Spark ML, MLReadable/Writeable) ● Online and offline accessibility basis_feature_sql = "..." df = spark.sql(basis_feature_sql) pipeline = Pipeline(stages=[tx_p1, es_dsl1, tx_p2, es_dsl2t, vec_asm, l_r) pipeline_model = pipeline.fit(df) scored_def = pipeline_model.transform(df) model_id = MA_store.save_model(pipeline_model) draft_id = MA_store.save_pipeline(basis_feature_sql, pipeline) retrain_job = MA_API.train(draft_id, new_basis_feature_sql)
  • 21. Takeaways Feature Store: Batch, Realtime and External Features with online and offline parity Offline scalability: Joins across billions of rows Online serving latency: Parallel IO, fast storage with caching Feature Transformers: Setup chains of transformations at training/serving time Pipeline reliability and monitoring out-of-the-box