DoorDash is a 3-sided marketplace that consists of Merchants, Consumers, and Dashers.
As DoorDash business grows, the online ML prediction volume grows exponentially to support the various Machine Learning use cases, such as the ETA predictions, the Dasher assignments, the personalized restaurants and menu items recommendations, and the ranking of the large volume of search queries.
The prediction service built to meet above use cases now supports many dozens of models spanning different Machine Learning algorithms such as gradient boosting, neural networks and rule-based. The service supports greater than 10 billion predictions every day with a peak hit rate of above 1 million per second.
In this session, we will share our journey of building and scaling our Machine Learning platform and particularly the prediction service, the various optimizations experimented, lessons learned, technical decisions and tradeoffs made. We will also share how we measure success and how we set goals for the future. Finally, we will end by highlighting the challenges ahead of us in extending our Machine Learning platform to support the Data Scientist community and a wider set use cases at DoorDash.
19. Machine Learning Platform Journey
ML Platform Pillars
Feature
Engineering
Model
Training
Model
Prediction
Model
Management
ML Insights
Think big, but start small
20. Machine Learning Platform Journey
Feature
Service
Feature
Store
(Redis)
Online Prediction
Service
Real-time
Features
Historical Features
Feature
Engineering
Production Systems
21. Machine Learning Platform Journey
Model Training
Service
Model Store
Online
Prediction
Service
Model
Training &
Management
Python
Python
Python
22. Machine Learning Platform Journey
Feature
Store
(Redis)
Online Prediction
Service
Prediction
Results
Model
Prediction
Model Store
Prediction Requests
Load models
Fetch features
30. Users
DSML Microservices
Business vector created by macrovector - www.freepik.com
- higher request
throughput
- more uptime
- more models
- new features
- more experiments
31. Four phases of scaling
I II III IV
User
Isolation
Optimize to
survive
Uh-oh
Infrastructure
Happy
Horizontal
Scaling
35. Use Case Latency
Dasher Dispatch 133.194 ms
delivery-predictors 103.838 ms
Feed ranking 33.146 ms
Item Recommendation 28.631 ms
ETA prediction 11.889 ms
Kitchen capacity 2.163 ms
43. Hitting the limits
peak
predictions /
sec
Prediction
Store
Model Store
Metrics
Prediction
Microservice
Feature
Store
Prediction
Microservice
Feature
Store
Prediction
Microservice
Feature
Store
Model Store
45. Hitting the limits
peak
predictions /
sec
Prediction
Store
Model Store
Metrics
Prediction
Microservice
Feature
Store
Prediction
Microservice
Feature
Store
Prediction
Microservice
Feature
Store
Model Store
51. Stifled infrastructure
- Splunk quota exceeded
- Wavefront metrics limit breached
- Blocked by Segment for sending “too many” events
- High Service Discovery (Consul) CPU threatening a total outage
52. Stifled infrastructure
- Splunk quota exceeded Only essential and sampled logging
- Wavefront metrics limit breached Move to prometheus from statsd
- Blocked by Segment for sending “too many” events Use in-house Kafka streaming
instead
- High Service Discovery (Consul) CPU threatening a total outage Beef up further to
reduce number of discoverable pods
54. Summary
• Isolate use cases wherever possible
• Scaling out always will either bust budgets or stop helping
• Pen down infrastructure dependencies and implications
55. Lessons Learned
• Happy path and unhappy less happy path
• Customer obsession
• Big vision, but build incrementally
56. Future Work
● More microservice optimizations
● Generalized model serving
○ NLP & Image recognition
● Unified prediction client