Saba Khalilnaji, DoorDash, Software Engineer + Ashwin Kachhara, DoorDash, Software Engineer
Scaling backend infrastructure to handle hyper-growth is one of the many exciting challenges of working at DoorDash. In this talk, we’ll discuss some scaling issues in 2019 that prompted us to accelerate our adoption of Kafka.
In mid 2019, we faced significant scaling challenges and frequent outages involving Celery and RabbitMQ, two technologies powering the system that handles the asynchronous work enabling critical functionalities of our platform, including order checkout and Dasher assignments. We quickly solved this problem with a simple, Apache Kafka-based asynchronous task processing system that stopped our outages while we continued to iterate on a robust solution. Our initial version implemented the smallest set of features needed to accommodate a large portion of existing Celery tasks. Once in production, we continued to add support for more Celery features while addressing novel problems that arose when using Kafka.
Thereafter, we adopted Kafka across a variety of domains either directly, or in conjunction with technologies like Flink and Cadence. Kafka’s ability to scale and provide at-least-once message delivery has been crucial for our use cases and given us a boost in reliability across several domains.
https://www.meetup.com/KafkaBayArea/events/274915506/?isFirstPublish=true
2. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
2
Contents
Introduction
Problems we faced with Celery / RabbitMQ
Potential solutions to problems with Celery / RabbitMQ
Kafka Onboarding Strategy
No solution is perfect
Key Wins
Other use-cases of Kafka at DoorDash
Conclusion
Acknowledgements
3. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
3
Tasks related to different use-cases
leverage different topics with their
dedicated worker pools, based on volume.
Introduction
5. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
5
Issues with availability
● Some of our outages were caused by heavy use of Celery scheduled tasks with ETA
● Sudden bursts of traffic left RabbitMQ in a degraded state with low throughput
● Our uWSGI worker’s harakiri setting caused a connection churn to RabbitMQ AND cascading failure
● Celery task processing would stop with no evidence of resource constraints, requiring a restart
6. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
6
Other problems with Celery and RabbitMQ
SCALABILITY
Reached the maximum vertical
scale available to us. The provider
HA mode limited our capacity.
OBSERVABILITY
Limited to a small set of RabbitMQ
metrics available to us. Limited
visibility into the Celery workers.
OPERATIONAL EFFICIENCY
Unsustainable time spent operating
and maintaining RabbitMQ. Not enough
in-house RabbitMQ expertise.
8. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
8
CELERY BROKER CHANGE
Continue using Celery with a potentially more
reliable backing data store.
MULTI-BROKER SYSTEM
Shard task processing across multiple
brokers to reduce average load.
RMQ / CELERY VERSION UPGRADE
Leverage potential reliability fixes in newer
versions, buying us some time.
CUSTOM KAFKA SOLUTION
More effort than any other solution, but potential
to solve all our problems (by design).
Potential solutions we considered
9. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
PROS
9
Change the Celery Broker to Redis
● Improved availability & observability w/ ECC & multi-AZ
● Improved operational efficiency
● In-house operational experience & expertise w/ Redis
● Broker swap is a simple supported option in Celery
● Connection churn doesn’t degrade Redis performance
● Incompatible w/ Redis clustered mode
● Single node Redis does not scale horizontally
● No Celery observability improvements
● Does not address stopped worker problem
CONS
Option #1
Does not solve scalability, only partially solves observability, and does not address worker stopped problem
10. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
PROS
10
Change the Celery Broker to Kafka
● Kafka can be highly available and horizontally scalable
● Improved observability and operational efficiency
● The team has lots of Kafka expertise
● Broker swap is a simple supported option in Celery
● Connection churn doesn’t degrade Kafka performance
● Kafka is not supported by Celery yet
● No Celery observability improvements
● Does not address stopped worker problem
● Insufficient experience operating Kafka at scale
CONS
Option #2
Only partially solves observability, does not address worker stopped problem AND not supported out of the box
11. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
PROS
11
Multi-Broker Solution
● Improved availability
● Horizontal scalability
● Comparatively less effort required
● No observability or operational efficiency boosts
● Does not address stopped worker problem
● Does not address connection churn issue
CONS
Option #3
Does not solve observability, connection churn, nor worker stopped problem
12. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
PROS
12
Upgrade both Celery & RabbitMQ versions
● Might prevent RabbitMQ getting stuck
● Might prevent Celery workers getting stuck
● Buys us time to work on a longer-term strategy
● Will not fix any issues immediately
● Requires newer versions of Python
● Does not address connection churn issue
CONS
Option #4
Might prevent stuck Celery workers, but doesn’t definitely solve anything else
13. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
PROS
13
Building a custom Kafka solution
● Kafka can be highly available and horizontally scalable
● Improved observability and operational efficiency
● Team has a lot of in-house Kafka expertise
● Broker change is a straightforward option
● Connection churn doesn’t degrade Kafka performance
● Addresses stopped worker problem
● More work to implement compared to other options
● Minimal team experience operating Kafka at scale
CONS
Option #5
Solves all our problems. Most amount of effort required, and limited experience operating at scale
15. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
15
It addressed all the problems we were facing, while also being an industry standard
that can scale. Kafka would give us full control over observability and availability.
Building a custom Kafka Solution!
17. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
HITTING THE GROUND RUNNING
17
Kafka Onboarding Strategy
Leverage the basic solution as we’re
iterating on other parts of it. “Racing a
car while swapping in a new fuel pump”
Maintain the same task interface for
seamless, no-hassle adoption and
minimize effort on the part of developers
NO-OP ADOPTION
Instead of a big flashy release, ship
smaller independent features that can
be individually tested
INCREMENTAL ROLLOUT, ZERO DOWNTIME
18. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
18
ONBOARDING STRATEGY
We built a minimum viable product (MVP) to
bring us interim stability and buy us time to
iterate on a more comprehensive solution.
Hitting the
ground running
19. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
19
ONBOARDING STRATEGY
We launched our MVP after 2 weeks of
development. We achieved an 80% reduction
in RabbitMQ task load a week after that.
Hitting the
ground running
20. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
20
Seamless adoption, incremental rollout
● We implemented a wrapper for Celery’s @task annotation
● Allowed us to route task submissions to either system dynamically
● As soon as a subfeature of Celery had been ported, tasks using it could now be migrated (seconds)
ONBOARDING STRATEGY
22. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
22
NO SOLUTION IS PERFECT
A “slow” message in a partition can
block all messages behind it from
getting processed.
Head-of-the-line
blocking
23. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
23
NO SOLUTION IS PERFECT
Consists of
● 1 x Local message queue
● 1 x Kafka-consumer process
● N x Task-executor processes
A “slow” message only blocks a single
task-executor process till it completes.
Other messages in the partition can
continue to flow.
Non-blocking
task consumer
24. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
24
● Kafka is not a hard dependency for Cadence
● Useful to execute & schedule multi-step workflows in a distributed service ecosystem
● Distributed, scalable, durable, and highly available
● Orchestration asynchronous business logic scalably and with resilience
Scheduled tasks (and more) via
26. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
26
Conclusion & Key Wins
NO MORE REPEATED
OUTAGES
Dealt with outage problem within 3 weeks
of development, giving us more time after
that to focus on esoteric features.
PROCESSING NO LONGER A BOTTLENECK
Task processing was no longer a bottleneck
allowing DoorDash to continuing growing
and serving customers
10x INCREASED OBSERVABILITY
Granular observability in prod and dev
environments, improving confidence as well
as developer productivity.
OPERATIONAL DECENTRALIZATION
Enable developers to debug their
operational issues, and perform
cluster-management ops if needed.
28. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
28
OTHER USE-CASES
Receive real-time production
and analytics events
Kafka REST Proxy
Apache Flink
Current Scale
● 800B events / day
● Peak > 200k / sec
Real-Time Streaming
Platform
29. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
29
OTHER USE-CASES
Standardized events with schema
defn. as Protobuf or Avro
● Low latency
● Lower costs
● Better Data Quality
Our Iguazu
Pipeline
30. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
30
OTHER USE-CASES
Huge boost in
● Indexing speed
● Accuracy
Search
Indexing
31. 31
It takes a village!
Engineering Branding:
Ezra Berger
Wayne Cunningham
3131
Engineering:
Clement Fang, Corry Haines, Danial Asif, Jay Weinstein, Luigi Tagliamonte, Matthew Anger,
Shaohua Zhou, Yun-Yu Chen, Allen Wang, Matan Amir
33. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
33
● https://doordash.engineering/2020/09/03/eliminating-task-processing-outages-with-kafka/
● https://doordash.engineering/2020/08/14/workflows-cadence-event-driven-processing/
● https://doordash.engineering/2020/09/25/how-doordash-is-scaling-its-data-platform/
Further Reading