Scaling Online ML Predictions At DoorDash

Scaling Online ML
Predictions at
DoorDash
Hien Luu & Arbaz Khan
ML Platform, DoorDash

Agenda
▪ DoorDash Marketplace
▪ ML @ DoorDash
▪ ML Platform Journey
▪ Scaling ML Online Predictions
▪ Lessons Learned & Future Work

DoorDash Marketplace
Our mission is to grow and empower local economies

Delivery & Pickup DashPass Subscription
Convenience & Grocery

Logistics Platform

Three-sided Marketplace - Flywheel Effect

Flywheel Effect
Merchant Dasher
Consumer
DoorDash
Flywheel

Food Order Lifecycle
Step 2
Order Checkout
Step 3
Dispatching Order
Step 4
Delivering Order
Creating Order
Step 1

Creating Order
Creating Order
Step 1
Recommendation
&
Search
Promotion

Order Checkout
Recommendation
Step 2
Order Checkout
Fraud

Dispatching Order
Step 3
Dispatching Order
Merchants
Dashers

Delivering Order
Step 4
Delivering Order
Consumers

Machine Learning Platform Journey

DS/ML Fraud
….
Centralized
ML
Platform
DS/ML Cx
DS/ML Mx
DS/ML Lx
DS/ML Fraud
….

ML Platform Pillars
Feature
Engineering
Model
Training
Model
Prediction
Model
Management
ML Insights
Think big, but start small

Feature
Service
Feature
Store
(Redis)
Online Prediction
Service
Real-time
Features
Historical Features
Feature
Engineering
Production Systems

Model Training
Service
Model Store
Online
Prediction
Service
Model
Training &
Management
Python
Python
Python

Feature
Store
(Redis)
Online Prediction
Service
Prediction
Results
Model
Prediction
Model Store
Prediction Requests
Load models
Fetch features

Isn’t it same as scaling any other
microservice?
Scaling Online Predictions

Larger payloads
Heavier computations
Production + Experiment traffic
Near real-time auditing
Challenges of scaling

Ok but what do you mean by scaling?

Users
DSML Microservices
Business vector created by macrovector - www.freepik.com

Users
DSML Microservices
- more models
- new features
- more experiments

Users
DSML Microservices
- higher request
throughput
- more uptime
- more models
- new features
- more experiments

Four phases of scaling
I II III IV
User
Isolation
Optimize to
survive
Uh-oh
Infrastructure
Happy
Horizontal
Scaling

Phase I: Happy Horizontal Scaling

Prediction
Store
Prediction
Microservice
Feature Store
Metrics
Model Store
Phase I: Happy Horizontal Scaling

Use Case Latency
Dasher Dispatch 133.194 ms
delivery-predictors 103.838 ms
Feed ranking 33.146 ms
Item Recommendation 28.631 ms
ETA prediction 11.889 ms
Kitchen capacity 2.163 ms

All in One
Strategies of User isolation

All in One
One service
per model

All in One
One service
per model
Hybrid

Is hybrid isolation the way to go always?

How far hybrid isolation took your scaling
attempts?

Hitting the limits
peak
predictions /
sec
Prediction
Store
Model Store
Metrics
Prediction
Microservice
Feature
Store
Prediction
Microservice
Feature
Store
Prediction
Microservice
Feature
Store
Model Store

Phase III: Optimize to survive

Microservice optimizations
Parameter
tuning
Runtime
optimizations
Load testing
Latency
proﬁling
OBSERVE ITERATE

Feature Store optimizations
Schema
Redesign
Benchmarking
OBSERVE ITERATE

peak
predictions /
sec
Prediction
Store
Model Store
Metrics
Prediction
Microservice
Feature
Store
V2
Prediction
Microservic
e
Feature
Store
Prediction
Microservic
e
Feature
Store
Model Store

Phase IV: Uh-oh infrastructure!!

Stiﬂed infrastructure
- Splunk quota exceeded
- Wavefront metrics limit breached
- Blocked by Segment for sending “too many” events
- High Service Discovery (Consul) CPU threatening a total outage

Stiﬂed infrastructure
- Splunk quota exceeded Only essential and sampled logging
- Wavefront metrics limit breached Move to prometheus from statsd
- Blocked by Segment for sending “too many” events Use in-house Kafka streaming
instead
- High Service Discovery (Consul) CPU threatening a total outage Beef up further to
reduce number of discoverable pods

peak
predictions /
sec
Prediction
Store
Model Store
Metrics
Prediction
Microservice
Feature Store
V2
Prediction
Microservice
Feature
Store
Prediction
Microservice
Feature
Store
Prediction
Microservice
Feature
Store
V2
Model Store

Summary
• Isolate use cases wherever possible
• Scaling out always will either bust budgets or stop helping
• Pen down infrastructure dependencies and implications

Lessons Learned
• Happy path and unhappy less happy path
• Customer obsession
• Big vision, but build incrementally

Future Work
● More microservice optimizations
● Generalized model serving
○ NLP & Image recognition
● Uniﬁed prediction client

https://doordash.engineering/category/data-science-and-machine-learning
Thank you

Scaling Online ML Predictions At DoorDash

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Scaling Online ML Predictions At DoorDash

Ähnlich wie Scaling Online ML Predictions At DoorDash (20)

Mehr von Databricks

Mehr von Databricks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Scaling Online ML Predictions At DoorDash