SlideShare ist ein Scribd-Unternehmen logo
1 von 33
1
SABA KHALILNAJI saba@doordash.com
ASHWIN KACHHARA ashwin@doordash.com
12/15/2020
Using Kafka to Replace RabbitMQ
and Eliminate Task Processing
Outages at DoorDash
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
2
Contents
Introduction
Problems we faced with Celery / RabbitMQ
Potential solutions to problems with Celery / RabbitMQ
Kafka Onboarding Strategy
No solution is perfect
Key Wins
Other use-cases of Kafka at DoorDash
Conclusion
Acknowledgements
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
3
Tasks related to different use-cases
leverage different topics with their
dedicated worker pools, based on volume.
Introduction
4
Problems we faced with
RabbitMQ & Celery
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
5
Issues with availability
● Some of our outages were caused by heavy use of Celery scheduled tasks with ETA
● Sudden bursts of traffic left RabbitMQ in a degraded state with low throughput
● Our uWSGI worker’s harakiri setting caused a connection churn to RabbitMQ AND cascading failure
● Celery task processing would stop with no evidence of resource constraints, requiring a restart
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
6
Other problems with Celery and RabbitMQ
SCALABILITY
Reached the maximum vertical
scale available to us. The provider
HA mode limited our capacity.
OBSERVABILITY
Limited to a small set of RabbitMQ
metrics available to us. Limited
visibility into the Celery workers.
OPERATIONAL EFFICIENCY
Unsustainable time spent operating
and maintaining RabbitMQ. Not enough
in-house RabbitMQ expertise.
7
Potential Solutions to the problems
with RabbitMQ and Celery
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
8
CELERY BROKER CHANGE
Continue using Celery with a potentially more
reliable backing data store.
MULTI-BROKER SYSTEM
Shard task processing across multiple
brokers to reduce average load.
RMQ / CELERY VERSION UPGRADE
Leverage potential reliability fixes in newer
versions, buying us some time.
CUSTOM KAFKA SOLUTION
More effort than any other solution, but potential
to solve all our problems (by design).
Potential solutions we considered
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
PROS
9
Change the Celery Broker to Redis
● Improved availability & observability w/ ECC & multi-AZ
● Improved operational efficiency
● In-house operational experience & expertise w/ Redis
● Broker swap is a simple supported option in Celery
● Connection churn doesn’t degrade Redis performance
● Incompatible w/ Redis clustered mode
● Single node Redis does not scale horizontally
● No Celery observability improvements
● Does not address stopped worker problem
CONS
Option #1
Does not solve scalability, only partially solves observability, and does not address worker stopped problem
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
PROS
10
Change the Celery Broker to Kafka
● Kafka can be highly available and horizontally scalable
● Improved observability and operational efficiency
● The team has lots of Kafka expertise
● Broker swap is a simple supported option in Celery
● Connection churn doesn’t degrade Kafka performance
● Kafka is not supported by Celery yet
● No Celery observability improvements
● Does not address stopped worker problem
● Insufficient experience operating Kafka at scale
CONS
Option #2
Only partially solves observability, does not address worker stopped problem AND not supported out of the box
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
PROS
11
Multi-Broker Solution
● Improved availability
● Horizontal scalability
● Comparatively less effort required
● No observability or operational efficiency boosts
● Does not address stopped worker problem
● Does not address connection churn issue
CONS
Option #3
Does not solve observability, connection churn, nor worker stopped problem
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
PROS
12
Upgrade both Celery & RabbitMQ versions
● Might prevent RabbitMQ getting stuck
● Might prevent Celery workers getting stuck
● Buys us time to work on a longer-term strategy
● Will not fix any issues immediately
● Requires newer versions of Python
● Does not address connection churn issue
CONS
Option #4
Might prevent stuck Celery workers, but doesn’t definitely solve anything else
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
PROS
13
Building a custom Kafka solution
● Kafka can be highly available and horizontally scalable
● Improved observability and operational efficiency
● Team has a lot of in-house Kafka expertise
● Broker change is a straightforward option
● Connection churn doesn’t degrade Kafka performance
● Addresses stopped worker problem
● More work to implement compared to other options
● Minimal team experience operating Kafka at scale
CONS
Option #5
Solves all our problems. Most amount of effort required, and limited experience operating at scale
14
And the winner is…
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
15
It addressed all the problems we were facing, while also being an industry standard
that can scale. Kafka would give us full control over observability and availability.
Building a custom Kafka Solution!
16
Kafka Onboarding
Strategy
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
HITTING THE GROUND RUNNING
17
Kafka Onboarding Strategy
Leverage the basic solution as we’re
iterating on other parts of it. “Racing a
car while swapping in a new fuel pump”
Maintain the same task interface for
seamless, no-hassle adoption and
minimize effort on the part of developers
NO-OP ADOPTION
Instead of a big flashy release, ship
smaller independent features that can
be individually tested
INCREMENTAL ROLLOUT, ZERO DOWNTIME
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
18
ONBOARDING STRATEGY
We built a minimum viable product (MVP) to
bring us interim stability and buy us time to
iterate on a more comprehensive solution.
Hitting the
ground running
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
19
ONBOARDING STRATEGY
We launched our MVP after 2 weeks of
development. We achieved an 80% reduction
in RabbitMQ task load a week after that.
Hitting the
ground running
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
20
Seamless adoption, incremental rollout
● We implemented a wrapper for Celery’s @task annotation
● Allowed us to route task submissions to either system dynamically
● As soon as a subfeature of Celery had been ported, tasks using it could now be migrated (seconds)
ONBOARDING STRATEGY
21
ITERATE AS NEEDED
No solution is perfect
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
22
NO SOLUTION IS PERFECT
A “slow” message in a partition can
block all messages behind it from
getting processed.
Head-of-the-line
blocking
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
23
NO SOLUTION IS PERFECT
Consists of
● 1 x Local message queue
● 1 x Kafka-consumer process
● N x Task-executor processes
A “slow” message only blocks a single
task-executor process till it completes.
Other messages in the partition can
continue to flow.
Non-blocking
task consumer
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
24
● Kafka is not a hard dependency for Cadence
● Useful to execute & schedule multi-step workflows in a distributed service ecosystem
● Distributed, scalable, durable, and highly available
● Orchestration asynchronous business logic scalably and with resilience
Scheduled tasks (and more) via
25
Conclusions
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
26
Conclusion & Key Wins
NO MORE REPEATED
OUTAGES
Dealt with outage problem within 3 weeks
of development, giving us more time after
that to focus on esoteric features.
PROCESSING NO LONGER A BOTTLENECK
Task processing was no longer a bottleneck
allowing DoorDash to continuing growing
and serving customers
10x INCREASED OBSERVABILITY
Granular observability in prod and dev
environments, improving confidence as well
as developer productivity.
OPERATIONAL DECENTRALIZATION
Enable developers to debug their
operational issues, and perform
cluster-management ops if needed.
27
Other notable use-cases
of Kafka at DoorDash
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
28
OTHER USE-CASES
Receive real-time production
and analytics events
Kafka REST Proxy
Apache Flink
Current Scale
● 800B events / day
● Peak > 200k / sec
Real-Time Streaming
Platform
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
29
OTHER USE-CASES
Standardized events with schema
defn. as Protobuf or Avro
● Low latency
● Lower costs
● Better Data Quality
Our Iguazu
Pipeline
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
30
OTHER USE-CASES
Huge boost in
● Indexing speed
● Accuracy
Search
Indexing
31
It takes a village!
Engineering Branding:
Ezra Berger
Wayne Cunningham
3131
Engineering:
Clement Fang, Corry Haines, Danial Asif, Jay Weinstein, Luigi Tagliamonte, Matthew Anger,
Shaohua Zhou, Yun-Yu Chen, Allen Wang, Matan Amir
32
SABA KHALILNAJI
ASHWIN KACHHARA
12/15/2020
Thank you
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
33
● https://doordash.engineering/2020/09/03/eliminating-task-processing-outages-with-kafka/
● https://doordash.engineering/2020/08/14/workflows-cadence-event-driven-processing/
● https://doordash.engineering/2020/09/25/how-doordash-is-scaling-its-data-platform/
Further Reading

Weitere ähnliche Inhalte

Mehr von confluent

Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Diveconfluent
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluentconfluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Meshconfluent
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservicesconfluent
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3confluent
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernizationconfluent
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataconfluent
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2confluent
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023confluent
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesisconfluent
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023confluent
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streamsconfluent
 
The Journey to Data Mesh with Confluent
The Journey to Data Mesh with ConfluentThe Journey to Data Mesh with Confluent
The Journey to Data Mesh with Confluentconfluent
 
Citi Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performanceconfluent
 
Confluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with ReplyConfluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with Replyconfluent
 
Citi Tech Talk Disaster Recovery Solutions Deep Dive
Citi Tech Talk  Disaster Recovery Solutions Deep DiveCiti Tech Talk  Disaster Recovery Solutions Deep Dive
Citi Tech Talk Disaster Recovery Solutions Deep Diveconfluent
 
Citi Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid CloudCiti Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid Cloudconfluent
 
Partner Tech Talk Q3: Q&A with PS - Migration and Upgrade
Partner Tech Talk Q3: Q&A with PS - Migration and UpgradePartner Tech Talk Q3: Q&A with PS - Migration and Upgrade
Partner Tech Talk Q3: Q&A with PS - Migration and Upgradeconfluent
 
Confluent Partner Tech Talk with QLIK
Confluent Partner Tech Talk with QLIKConfluent Partner Tech Talk with QLIK
Confluent Partner Tech Talk with QLIKconfluent
 
Real-time Streaming for Government and the Public Sector
Real-time Streaming for Government and the Public SectorReal-time Streaming for Government and the Public Sector
Real-time Streaming for Government and the Public Sectorconfluent
 

Mehr von confluent (20)

Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 
The Journey to Data Mesh with Confluent
The Journey to Data Mesh with ConfluentThe Journey to Data Mesh with Confluent
The Journey to Data Mesh with Confluent
 
Citi Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performance
 
Confluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with ReplyConfluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with Reply
 
Citi Tech Talk Disaster Recovery Solutions Deep Dive
Citi Tech Talk  Disaster Recovery Solutions Deep DiveCiti Tech Talk  Disaster Recovery Solutions Deep Dive
Citi Tech Talk Disaster Recovery Solutions Deep Dive
 
Citi Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid CloudCiti Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid Cloud
 
Partner Tech Talk Q3: Q&A with PS - Migration and Upgrade
Partner Tech Talk Q3: Q&A with PS - Migration and UpgradePartner Tech Talk Q3: Q&A with PS - Migration and Upgrade
Partner Tech Talk Q3: Q&A with PS - Migration and Upgrade
 
Confluent Partner Tech Talk with QLIK
Confluent Partner Tech Talk with QLIKConfluent Partner Tech Talk with QLIK
Confluent Partner Tech Talk with QLIK
 
Real-time Streaming for Government and the Public Sector
Real-time Streaming for Government and the Public SectorReal-time Streaming for Government and the Public Sector
Real-time Streaming for Government and the Public Sector
 

Kürzlich hochgeladen

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Kürzlich hochgeladen (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Doordash: Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages

  • 1. 1 SABA KHALILNAJI saba@doordash.com ASHWIN KACHHARA ashwin@doordash.com 12/15/2020 Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
  • 2. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 2 Contents Introduction Problems we faced with Celery / RabbitMQ Potential solutions to problems with Celery / RabbitMQ Kafka Onboarding Strategy No solution is perfect Key Wins Other use-cases of Kafka at DoorDash Conclusion Acknowledgements
  • 3. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 3 Tasks related to different use-cases leverage different topics with their dedicated worker pools, based on volume. Introduction
  • 4. 4 Problems we faced with RabbitMQ & Celery
  • 5. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 5 Issues with availability ● Some of our outages were caused by heavy use of Celery scheduled tasks with ETA ● Sudden bursts of traffic left RabbitMQ in a degraded state with low throughput ● Our uWSGI worker’s harakiri setting caused a connection churn to RabbitMQ AND cascading failure ● Celery task processing would stop with no evidence of resource constraints, requiring a restart
  • 6. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 6 Other problems with Celery and RabbitMQ SCALABILITY Reached the maximum vertical scale available to us. The provider HA mode limited our capacity. OBSERVABILITY Limited to a small set of RabbitMQ metrics available to us. Limited visibility into the Celery workers. OPERATIONAL EFFICIENCY Unsustainable time spent operating and maintaining RabbitMQ. Not enough in-house RabbitMQ expertise.
  • 7. 7 Potential Solutions to the problems with RabbitMQ and Celery
  • 8. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 8 CELERY BROKER CHANGE Continue using Celery with a potentially more reliable backing data store. MULTI-BROKER SYSTEM Shard task processing across multiple brokers to reduce average load. RMQ / CELERY VERSION UPGRADE Leverage potential reliability fixes in newer versions, buying us some time. CUSTOM KAFKA SOLUTION More effort than any other solution, but potential to solve all our problems (by design). Potential solutions we considered
  • 9. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash PROS 9 Change the Celery Broker to Redis ● Improved availability & observability w/ ECC & multi-AZ ● Improved operational efficiency ● In-house operational experience & expertise w/ Redis ● Broker swap is a simple supported option in Celery ● Connection churn doesn’t degrade Redis performance ● Incompatible w/ Redis clustered mode ● Single node Redis does not scale horizontally ● No Celery observability improvements ● Does not address stopped worker problem CONS Option #1 Does not solve scalability, only partially solves observability, and does not address worker stopped problem
  • 10. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash PROS 10 Change the Celery Broker to Kafka ● Kafka can be highly available and horizontally scalable ● Improved observability and operational efficiency ● The team has lots of Kafka expertise ● Broker swap is a simple supported option in Celery ● Connection churn doesn’t degrade Kafka performance ● Kafka is not supported by Celery yet ● No Celery observability improvements ● Does not address stopped worker problem ● Insufficient experience operating Kafka at scale CONS Option #2 Only partially solves observability, does not address worker stopped problem AND not supported out of the box
  • 11. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash PROS 11 Multi-Broker Solution ● Improved availability ● Horizontal scalability ● Comparatively less effort required ● No observability or operational efficiency boosts ● Does not address stopped worker problem ● Does not address connection churn issue CONS Option #3 Does not solve observability, connection churn, nor worker stopped problem
  • 12. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash PROS 12 Upgrade both Celery & RabbitMQ versions ● Might prevent RabbitMQ getting stuck ● Might prevent Celery workers getting stuck ● Buys us time to work on a longer-term strategy ● Will not fix any issues immediately ● Requires newer versions of Python ● Does not address connection churn issue CONS Option #4 Might prevent stuck Celery workers, but doesn’t definitely solve anything else
  • 13. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash PROS 13 Building a custom Kafka solution ● Kafka can be highly available and horizontally scalable ● Improved observability and operational efficiency ● Team has a lot of in-house Kafka expertise ● Broker change is a straightforward option ● Connection churn doesn’t degrade Kafka performance ● Addresses stopped worker problem ● More work to implement compared to other options ● Minimal team experience operating Kafka at scale CONS Option #5 Solves all our problems. Most amount of effort required, and limited experience operating at scale
  • 15. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 15 It addressed all the problems we were facing, while also being an industry standard that can scale. Kafka would give us full control over observability and availability. Building a custom Kafka Solution!
  • 17. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash HITTING THE GROUND RUNNING 17 Kafka Onboarding Strategy Leverage the basic solution as we’re iterating on other parts of it. “Racing a car while swapping in a new fuel pump” Maintain the same task interface for seamless, no-hassle adoption and minimize effort on the part of developers NO-OP ADOPTION Instead of a big flashy release, ship smaller independent features that can be individually tested INCREMENTAL ROLLOUT, ZERO DOWNTIME
  • 18. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 18 ONBOARDING STRATEGY We built a minimum viable product (MVP) to bring us interim stability and buy us time to iterate on a more comprehensive solution. Hitting the ground running
  • 19. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 19 ONBOARDING STRATEGY We launched our MVP after 2 weeks of development. We achieved an 80% reduction in RabbitMQ task load a week after that. Hitting the ground running
  • 20. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 20 Seamless adoption, incremental rollout ● We implemented a wrapper for Celery’s @task annotation ● Allowed us to route task submissions to either system dynamically ● As soon as a subfeature of Celery had been ported, tasks using it could now be migrated (seconds) ONBOARDING STRATEGY
  • 21. 21 ITERATE AS NEEDED No solution is perfect
  • 22. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 22 NO SOLUTION IS PERFECT A “slow” message in a partition can block all messages behind it from getting processed. Head-of-the-line blocking
  • 23. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 23 NO SOLUTION IS PERFECT Consists of ● 1 x Local message queue ● 1 x Kafka-consumer process ● N x Task-executor processes A “slow” message only blocks a single task-executor process till it completes. Other messages in the partition can continue to flow. Non-blocking task consumer
  • 24. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 24 ● Kafka is not a hard dependency for Cadence ● Useful to execute & schedule multi-step workflows in a distributed service ecosystem ● Distributed, scalable, durable, and highly available ● Orchestration asynchronous business logic scalably and with resilience Scheduled tasks (and more) via
  • 26. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 26 Conclusion & Key Wins NO MORE REPEATED OUTAGES Dealt with outage problem within 3 weeks of development, giving us more time after that to focus on esoteric features. PROCESSING NO LONGER A BOTTLENECK Task processing was no longer a bottleneck allowing DoorDash to continuing growing and serving customers 10x INCREASED OBSERVABILITY Granular observability in prod and dev environments, improving confidence as well as developer productivity. OPERATIONAL DECENTRALIZATION Enable developers to debug their operational issues, and perform cluster-management ops if needed.
  • 27. 27 Other notable use-cases of Kafka at DoorDash
  • 28. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 28 OTHER USE-CASES Receive real-time production and analytics events Kafka REST Proxy Apache Flink Current Scale ● 800B events / day ● Peak > 200k / sec Real-Time Streaming Platform
  • 29. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 29 OTHER USE-CASES Standardized events with schema defn. as Protobuf or Avro ● Low latency ● Lower costs ● Better Data Quality Our Iguazu Pipeline
  • 30. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 30 OTHER USE-CASES Huge boost in ● Indexing speed ● Accuracy Search Indexing
  • 31. 31 It takes a village! Engineering Branding: Ezra Berger Wayne Cunningham 3131 Engineering: Clement Fang, Corry Haines, Danial Asif, Jay Weinstein, Luigi Tagliamonte, Matthew Anger, Shaohua Zhou, Yun-Yu Chen, Allen Wang, Matan Amir
  • 33. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 33 ● https://doordash.engineering/2020/09/03/eliminating-task-processing-outages-with-kafka/ ● https://doordash.engineering/2020/08/14/workflows-cadence-event-driven-processing/ ● https://doordash.engineering/2020/09/25/how-doordash-is-scaling-its-data-platform/ Further Reading