Have you ever migrated Kafka clusters from one data center to another being completely transparent to client applications?
At PayPal, as part of a massive datacenter migration initiative, Kafka team successfully moved all PayPal Kafka traffic across data centers. This initiative involved migrating 20+ Kafka clusters (1000+ broker and zookeeper nodes), as well as 60+ mirrormaker groups which seamlessly handle Kafka traffic volumes as high as 1 trillion messages per day. Throughout the course of this migration, applications required no modification, encountered 0% service outage, 0% message loss and duplicated messages. The whole migration process was fully transparent to Kafka applications.
In this session, you will learn the strategies, techniques and tools the PayPal Kafka team has utilized for managing the migration process. You will also learn the lessons and pitfalls they experienced during this exercise, as well as the secret sauce of making the migration successful.
5. Kafka @PayPal
Kafka tech stacks at PayPal
Languages
Gimel
Application Frameworks Multi-Tenant
Multiple Regions &
Availability Zones
6. Kafka @PayPal
Kafka use cases at PayPal
Monitoring Metrics Streaming & Aggregation
Log Aggregation
Batch Processing
User Activity Tracking
Risk & Compliance
Use Cases
7. PayPal Data Center Migration
• Migrate from eBay datacenter to dedicated
PayPal datacenter
• Consolidate and simplify PayPal’s North American
data center
• Scale up data center computing power and
network bandwidth
• Increase flexibility, scalability and reduce data
center cost
• Migrate datacenter without interrupting PayPal
business
• Fully transparent to PayPal customers
• Exit eBay datacenter within a hard deadline
Business objective to migrate
8. M
i
r
r
o
r
M
a
k
e
r
M
i
r
r
o
r
M
a
k
e
r
PayPal Kafka Platform Migration Scope
Kafka footprint and migration scope
Kafka App
Kafka App
Kafka App
Primary Data Center
New Primary Data Center Analytics Data Center
Kafka App
Kafka App
Kafka App
Kafka App
Kafka App
Kafka App
Mirror Maker
Kafka Migration
Ø 20+ Kafka clusters
Ø 20+ Zookeeper clusters
Ø 2500+ Kafka Topics
Ø 60+ Mirror Maker Groups
Secondary Data Center
Kafka App
Kafka App
Kafka App
Mirror Maker
9. PayPal Kafka Platform Migration Approach
The steps we follow for this migration
Strategy Planning Execution Verification
• Define
requirement
• Budget
allocation
• Design process
• Setup timeline
• Plan for each
component
(brokers,
zookeepers,
mirrormakers)
migration
• Plan for each
cluster and app
group migration
• Team allocation
• Phased execution
• Risk analysis and
control
• Customer
notification and
cluster monitoring
• Detailed execution
steps and
rollback plan
• Refine execution
steps from
previous
migration experien
ces
• Cluster-level
verification
• Application-
level
verification
10. PayPal Kafka Platform Migration Strategy
• Migration requirement
• No business down time during the migration
• No introduction of data loss and message duplication during the migration
• Migration is managed by the Kafka Infra team and transparent to our customers
• Hardware capacity and specs within budget
• Sufficient Kafka cluster capacity in the new datacenter
• Carefully-chosen hardware configuration for optimal performance
• Well-defined migration process
• Brokers, zookeepers and mirror makers migration
• Well-planned migration timeline for each phase
• Meet the PayPal datacenter migration deadline
Migration requirement, buget, process and timeline
11. PayPal Kafka Platform Migration Key Building Blocks
Topic lookup service enabled transparent migration of 2500+ Kafka topics
Kafka Publisher Kafka Consumer
Topic
Lookup Service
Kafka
Metadata DB
5. Publish messages 5. Consume messages
1. Request
bootstrap servers
using topic name,
colo, security zone
1. Request bootstrap
servers using topic
name, colo, security
zone
4. Use bootstrap
servers to connect
to Kafka cluster
4. Use bootstrap
servers to connect
to Kafka cluster
2. Bootstrap server lookup
3. Return bootstrap
servers list
3. Return bootstrap
servers list
PayPal Kafka client PayPal Kafka client
12. PayPal Kafka Platform Migration Key Building Blocks
Cross datacenter Kafka cluster and zookeeper quorum enabled seamless migration
Kafka Cluster
Old DC
Kafka Cluster
Producer Consumer
Zookeeper Quorum
Mirror Makers
New DC
Old DC New DC
Producer Consumer
Kafka Cluster
Kafka Cluster
Producer Consumer
Zookeeper Quorum
Mirror Makers Mirror Makers
Kafka Cluster
Producer Consumer
Zookeeper Quorum
Mirror Makers
Kafka Cluster
Flex down
Flex up
1
2 Topic partition reassignment
Applications move to new bootstrap servers
Zookeeper migration
Mirror maker migration
3
15. Migration Challenges
• Brokers are stateful
• Bootstrapping can be hard
Kafka cluster migration is hard
Image source: https://jack-vanlightly.com/sketches/2018/10/2/kafka-vs-pulsar-rebalancing-sketch
16. Migration Challenges
• Complex ecosystem
• Large number of topics and client
applications
• Application ownership across multiple
verticals
• Multi-tenant topics and clusters
• Dual-role client applications
Project Specific complexities
17. Migration Challenges
• Not a strict 1 to 1 mapping
• Multiple clusters mapping to a single
cluster
• Application deployment pattern
changes
Project Specific complexities
Kafka
Cluster 1
Kafka
Cluster 2
Kafka
Cluster 3
Kafka
Cluster A
Kafka
Cluster B
Old DC New DC
18. Recap: migration challenges
• Kafka cluster migration is hard
• Brokers are stateful
• Bootstrapping can be hard
• Project specific complexities
• PayPal Kafka ecosystem
is complex
• Not a strict 1 to 1 mapping migration
• Goal and requirement
• No data loss
• No introduction of message duplicates
• Minimum application disturbance
19. Cluster and data migration
Exploration: 1-way mirroring
1. MirrorMaker for T from old to new
2. Consumer migration
a. Shutdown completely in old DC @t1
b. Consumer group offset on T in new DC to @t1-buffer
c. Start in new DC
3. Producer migration
New DC
Pub
T
Old DC
Pub
Sub
T
Sub
MM
1
2
3
Pros:
• No app config change
• Moderate capacity and
network overhead
Cons:
• Need to enforce migration
order
20. Exploration: 2-way mirroring
1. Set up 2-way mirroring for T
a. T à new.T on old DC
b. T à old.T on new DC
2. Consume from T and *.T in both new and old
3. Consumer migration and/or producer migration
New DC
Pub
Old DC
Pub
Sub Sub
T
New.T
T
Old.T
MM
1
Old.T New.T
2
3
3
Cluster and data migration
Pros:
• Migration in any order or
sequence
Cons:
• Application config change
• More capacity and network
overhead
21. • One-way mirroring
• No application config change
• Less additional capacity
• Less network overhead
• Need to enforce migration order
Pros and cons of the mirroring approaches
Cluster and data migration
• Two-way mirroring
• Migration in any order or sequence
• Need application config change
• More capacity and network overhead
• Problems with both mirroring approaches
• Introduction of message duplicates
• Increased latency
22. Cluster mitosis
• Broker migration
1
2
3
4
5
6
1
2
3
4
5
6
1. ls /brokers/ids
[1,2,3]
2. ls /brokers/ids
[1,2,3,4,5,6]
Expand
1. Pair up old and new brokers
2. Current partition replica assignment
{“version”:1,”partitions”:[{“topic”:”T”,”partition”:0,”replicas”:[3,1,2]}]}
3. Optimized reassignment plan
{“version”:1,”partitions”:[{“topic”:”T”,”partition”:0,”replicas”:[6,4,5]}]}
4. Repartition
Move
1. ls /brokers/ids
[1,2,3,4,5,6]
2. ls /brokers/ids
[1,2,3]
Shrink
Cluster and data migration
23. • Comparing to mirroring
• No data loss
• No introduction of message duplicates
• No impact to latency
• Transparent to applications
Cluster mitosis
Cluster and data migration
24. Cluster mitosis
• Zookeeper migration
0. A zookeeper cluster of 2N+1 node
1. Add 2N new nodes à 4N+1 quorum
[Update and rolling restart brokers to connect to only new
zookeeper nodes]
2. Remove 2N old nodes, à 2N+1 quorum
3. Replace the last old node
L
L
L
L
0
1
2
3
Cluster and data migration
25. Bootstrapping to new nodes
Metadata
Store
Old.1
Old.2
Old.3
New.1
New.2
New.3
Old DC New DC
1. Before all topic partitions are moved to new nodes:
a. Metadata
b. App in old DC: no change, no restart needed
c. App moved and started in new DC
2. Broker repartition to move all topic partitions to new nodes
3. After all topic partitions are moved to new nodes:
a. Metadata
b. Client app restart in both old and new DC
Kafka Topic
Lookup Service
Kafka Topic
Lookup Service
PayPal Kafka Library
Client-app
PayPal Kafka Library
Client-app
Client application migration
26. Exploration: flipping without restarting
1. Preparation
a. Use ip addresses for inter broker communication
b. Set leader_epoch on new cluster
c. Restart new cluster controller
2. Merging/Execution
a. Clean shut down all brokers in old cluster
b. Make CNAME change
c. Application connections flip
3. Clean up
a. Restart old cluster
b. Start MirrorMaker to drain unconsumed messages
c. Shut down old cluster completely
New.1
New.2
New.3
Client-app
Client-app
Client-app
New.1 with
CNAME: Old.1
New.2 with
CNAME: Old.2
New.3 with
CNAME: Old.3
Client application migration
Old.1
Old.2
Old.3
27. Tools, optimization, etc.
• Ansible scripts for host validation and deployment
• Thoroughness
• Standardize the process
• Deployment automation
• Repartition optimization
• Start migration of topics with low traffic first
• 1 to 1 mapping between new and old brokers
28. Lessons Learnt
• Bootstrapping
• Critical for Kafka migration
• Can become complicated in multi-tenant eco-system
• An independent topic lookup service with flexibility of metadata control helps a lot
• Metadata
• Well-designed schema
• Upfront and proactive tracking and collecting of client metadata
29. Lessons Learnt
• Metrics
• Indispensable during migration
• Most important metrics during migration
Categories Sample Metrics
Volume Tracking MessagesIn/OutPerSec
BytesIn/OutPerSec
Partitions ISR Expands/Shrinks
AtMinIsrPartitions count/Single Replicas
UnderReplicatedPartitions/OfflinePartitions
System CPU, Memory, FD count, iowait
Page swapping
Heap/thread dumps
Other Active Controller Count
ZK Connection/Session Timeouts
Inter-broker, broker->zk network latencies
30. Future
Future PayPal Kafka Platform
• More reliable and scalable
• More resilient
• Become cloud native
• Cloud and On-prem hybrid
platform