SlideShare a Scribd company logo
1 of 31
Download to read offline
How did We Move a Mountain?
Migrating 1 trillion+ messages per day
across data centers at PayPal
Agenda
PayPal
Kafka @PayPal
PayPal Kafka Platform Migration
Migration Challenges and Solution
Lessons learnt
Future PayPal Kafka Platform
PayPal
PayPal 2021 business metrics
Growth Metrics
19% YoY *
Revenue
* https://www.fool.com/earnings/call-transcripts/2021/07/29/paypal-holdings-pypl-q2-2021-earnings-call-transcr/
Company Facts
403 million > 1.1 trillion 35,867
Active consumer and
merchant accounts
Total payment volume
in last 12 months
Payment transaction per
minute
27,700
Employees
Q2,20 Q2,21
6.4B
5.26B
40% YoY *
Total Payment
Volume
Q2,20 Q2,21
311B
222B
40% YoY *
Active User
Account
Q2,20 Q2,21
403M
346M
27% YoY *
Transaction
Q2,20 Q2,21
4.74B
3.7B
Kafka @PayPal
Kafka footprint at PayPal
Applications
Kafka Clusters Kafka Brokers Messages/day
2021 Kafka
Metrics
60+ 1000+ 1+ Trillion
800+ Kafka
Apps
5000+
Kafka Topics
0.9 0.10 1.1
0.8
Kafka Journey 2.2
2015 2016 2017 2018 2019+
Kafka @PayPal
Kafka tech stacks at PayPal
Languages
Gimel
Application Frameworks Multi-Tenant
Multiple Regions &
Availability Zones
Kafka @PayPal
Kafka use cases at PayPal
Monitoring Metrics Streaming & Aggregation
Log Aggregation
Batch Processing
User Activity Tracking
Risk & Compliance
Use Cases
PayPal Data Center Migration
• Migrate from eBay datacenter to dedicated
PayPal datacenter
• Consolidate and simplify PayPal’s North American
data center
• Scale up data center computing power and
network bandwidth
• Increase flexibility, scalability and reduce data
center cost
• Migrate datacenter without interrupting PayPal
business
• Fully transparent to PayPal customers
• Exit eBay datacenter within a hard deadline
Business objective to migrate
M
i
r
r
o
r
M
a
k
e
r
M
i
r
r
o
r
M
a
k
e
r
PayPal Kafka Platform Migration Scope
Kafka footprint and migration scope
Kafka App
Kafka App
Kafka App
Primary Data Center
New Primary Data Center Analytics Data Center
Kafka App
Kafka App
Kafka App
Kafka App
Kafka App
Kafka App
Mirror Maker
Kafka Migration
Ø 20+ Kafka clusters
Ø 20+ Zookeeper clusters
Ø 2500+ Kafka Topics
Ø 60+ Mirror Maker Groups
Secondary Data Center
Kafka App
Kafka App
Kafka App
Mirror Maker
PayPal Kafka Platform Migration Approach
The steps we follow for this migration
Strategy Planning Execution Verification
• Define
requirement
• Budget
allocation
• Design process
• Setup timeline
• Plan for each
component
(brokers,
zookeepers,
mirrormakers)
migration
• Plan for each
cluster and app
group migration
• Team allocation
• Phased execution
• Risk analysis and
control
• Customer
notification and
cluster monitoring
• Detailed execution
steps and
rollback plan
• Refine execution
steps from
previous
migration experien
ces
• Cluster-level
verification
• Application-
level
verification
PayPal Kafka Platform Migration Strategy
• Migration requirement
• No business down time during the migration
• No introduction of data loss and message duplication during the migration
• Migration is managed by the Kafka Infra team and transparent to our customers
• Hardware capacity and specs within budget
• Sufficient Kafka cluster capacity in the new datacenter
• Carefully-chosen hardware configuration for optimal performance
• Well-defined migration process
• Brokers, zookeepers and mirror makers migration
• Well-planned migration timeline for each phase
• Meet the PayPal datacenter migration deadline
Migration requirement, buget, process and timeline
PayPal Kafka Platform Migration Key Building Blocks
Topic lookup service enabled transparent migration of 2500+ Kafka topics
Kafka Publisher Kafka Consumer
Topic
Lookup Service
Kafka
Metadata DB
5. Publish messages 5. Consume messages
1. Request
bootstrap servers
using topic name,
colo, security zone
1. Request bootstrap
servers using topic
name, colo, security
zone
4. Use bootstrap
servers to connect
to Kafka cluster
4. Use bootstrap
servers to connect
to Kafka cluster
2. Bootstrap server lookup
3. Return bootstrap
servers list
3. Return bootstrap
servers list
PayPal Kafka client PayPal Kafka client
PayPal Kafka Platform Migration Key Building Blocks
Cross datacenter Kafka cluster and zookeeper quorum enabled seamless migration
Kafka Cluster
Old DC
Kafka Cluster
Producer Consumer
Zookeeper Quorum
Mirror Makers
New DC
Old DC New DC
Producer Consumer
Kafka Cluster
Kafka Cluster
Producer Consumer
Zookeeper Quorum
Mirror Makers Mirror Makers
Kafka Cluster
Producer Consumer
Zookeeper Quorum
Mirror Makers
Kafka Cluster
Flex down
Flex up
1
2 Topic partition reassignment
Applications move to new bootstrap servers
Zookeeper migration
Mirror maker migration
3
Kafka Cluster
PayPal Kafka Platform Migration Key Building Blocks
Centralized Kafka monitoring system helped tracking migration progress
Kafka Cluster
Kafka Producers
Kafka Consumers
SignalFX
Monitoring
Metrics
Monitoring Dashboard View
Kafka Brokers
Migration Challenges and
Path to Success
Migration Challenges
• Brokers are stateful
• Bootstrapping can be hard
Kafka cluster migration is hard
Image source: https://jack-vanlightly.com/sketches/2018/10/2/kafka-vs-pulsar-rebalancing-sketch
Migration Challenges
• Complex ecosystem
• Large number of topics and client
applications
• Application ownership across multiple
verticals
• Multi-tenant topics and clusters
• Dual-role client applications
Project Specific complexities
Migration Challenges
• Not a strict 1 to 1 mapping
• Multiple clusters mapping to a single
cluster
• Application deployment pattern
changes
Project Specific complexities
Kafka
Cluster 1
Kafka
Cluster 2
Kafka
Cluster 3
Kafka
Cluster A
Kafka
Cluster B
Old DC New DC
Recap: migration challenges
• Kafka cluster migration is hard
• Brokers are stateful
• Bootstrapping can be hard
• Project specific complexities
• PayPal Kafka ecosystem
is complex
• Not a strict 1 to 1 mapping migration
• Goal and requirement
• No data loss
• No introduction of message duplicates
• Minimum application disturbance
Cluster and data migration
Exploration: 1-way mirroring
1. MirrorMaker for T from old to new
2. Consumer migration
a. Shutdown completely in old DC @t1
b. Consumer group offset on T in new DC to @t1-buffer
c. Start in new DC
3. Producer migration
New DC
Pub
T
Old DC
Pub
Sub
T
Sub
MM
1
2
3
Pros:
• No app config change
• Moderate capacity and
network overhead
Cons:
• Need to enforce migration
order
Exploration: 2-way mirroring
1. Set up 2-way mirroring for T
a. T à new.T on old DC
b. T à old.T on new DC
2. Consume from T and *.T in both new and old
3. Consumer migration and/or producer migration
New DC
Pub
Old DC
Pub
Sub Sub
T
New.T
T
Old.T
MM
1
Old.T New.T
2
3
3
Cluster and data migration
Pros:
• Migration in any order or
sequence
Cons:
• Application config change
• More capacity and network
overhead
• One-way mirroring
• No application config change
• Less additional capacity
• Less network overhead
• Need to enforce migration order
Pros and cons of the mirroring approaches
Cluster and data migration
• Two-way mirroring
• Migration in any order or sequence
• Need application config change
• More capacity and network overhead
• Problems with both mirroring approaches
• Introduction of message duplicates
• Increased latency
Cluster mitosis
• Broker migration
1
2
3
4
5
6
1
2
3
4
5
6
1. ls /brokers/ids
[1,2,3]
2. ls /brokers/ids
[1,2,3,4,5,6]
Expand
1. Pair up old and new brokers
2. Current partition replica assignment
{“version”:1,”partitions”:[{“topic”:”T”,”partition”:0,”replicas”:[3,1,2]}]}
3. Optimized reassignment plan
{“version”:1,”partitions”:[{“topic”:”T”,”partition”:0,”replicas”:[6,4,5]}]}
4. Repartition
Move
1. ls /brokers/ids
[1,2,3,4,5,6]
2. ls /brokers/ids
[1,2,3]
Shrink
Cluster and data migration
• Comparing to mirroring
• No data loss
• No introduction of message duplicates
• No impact to latency
• Transparent to applications
Cluster mitosis
Cluster and data migration
Cluster mitosis
• Zookeeper migration
0. A zookeeper cluster of 2N+1 node
1. Add 2N new nodes à 4N+1 quorum
[Update and rolling restart brokers to connect to only new
zookeeper nodes]
2. Remove 2N old nodes, à 2N+1 quorum
3. Replace the last old node
L
L
L
L
0
1
2
3
Cluster and data migration
Bootstrapping to new nodes
Metadata
Store
Old.1
Old.2
Old.3
New.1
New.2
New.3
Old DC New DC
1. Before all topic partitions are moved to new nodes:
a. Metadata
b. App in old DC: no change, no restart needed
c. App moved and started in new DC
2. Broker repartition to move all topic partitions to new nodes
3. After all topic partitions are moved to new nodes:
a. Metadata
b. Client app restart in both old and new DC
Kafka Topic
Lookup Service
Kafka Topic
Lookup Service
PayPal Kafka Library
Client-app
PayPal Kafka Library
Client-app
Client application migration
Exploration: flipping without restarting
1. Preparation
a. Use ip addresses for inter broker communication
b. Set leader_epoch on new cluster
c. Restart new cluster controller
2. Merging/Execution
a. Clean shut down all brokers in old cluster
b. Make CNAME change
c. Application connections flip
3. Clean up
a. Restart old cluster
b. Start MirrorMaker to drain unconsumed messages
c. Shut down old cluster completely
New.1
New.2
New.3
Client-app
Client-app
Client-app
New.1 with
CNAME: Old.1
New.2 with
CNAME: Old.2
New.3 with
CNAME: Old.3
Client application migration
Old.1
Old.2
Old.3
Tools, optimization, etc.
• Ansible scripts for host validation and deployment
• Thoroughness
• Standardize the process
• Deployment automation
• Repartition optimization
• Start migration of topics with low traffic first
• 1 to 1 mapping between new and old brokers
Lessons Learnt
• Bootstrapping
• Critical for Kafka migration
• Can become complicated in multi-tenant eco-system
• An independent topic lookup service with flexibility of metadata control helps a lot
• Metadata
• Well-designed schema
• Upfront and proactive tracking and collecting of client metadata
Lessons Learnt
• Metrics
• Indispensable during migration
• Most important metrics during migration
Categories Sample Metrics
Volume Tracking MessagesIn/OutPerSec
BytesIn/OutPerSec
Partitions ISR Expands/Shrinks
AtMinIsrPartitions count/Single Replicas
UnderReplicatedPartitions/OfflinePartitions
System CPU, Memory, FD count, iowait
Page swapping
Heap/thread dumps
Other Active Controller Count
ZK Connection/Session Timeouts
Inter-broker, broker->zk network latencies
Future
Future PayPal Kafka Platform
• More reliable and scalable
• More resilient
• Become cloud native
• Cloud and On-prem hybrid
platform
Thank You!

More Related Content

What's hot

What's hot (20)

Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, ConfluentMaking Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedApache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
 
Changing landscapes in data integration - Kafka Connect for near real-time da...
Changing landscapes in data integration - Kafka Connect for near real-time da...Changing landscapes in data integration - Kafka Connect for near real-time da...
Changing landscapes in data integration - Kafka Connect for near real-time da...
 
Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...
Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...
Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...
 
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
 
Supercharge Your Real-time Event Processing with Neo4j's Streams Kafka Connec...
Supercharge Your Real-time Event Processing with Neo4j's Streams Kafka Connec...Supercharge Your Real-time Event Processing with Neo4j's Streams Kafka Connec...
Supercharge Your Real-time Event Processing with Neo4j's Streams Kafka Connec...
 
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...
 
Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...
Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...
Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...
 
Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...
Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...
Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...
 
Building Event-Driven Services with Apache Kafka
Building Event-Driven Services with Apache KafkaBuilding Event-Driven Services with Apache Kafka
Building Event-Driven Services with Apache Kafka
 
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
 
How to over-engineer things and have fun? | Oto Brglez, OPALAB
How to over-engineer things and have fun? | Oto Brglez, OPALABHow to over-engineer things and have fun? | Oto Brglez, OPALAB
How to over-engineer things and have fun? | Oto Brglez, OPALAB
 
URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know
 
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...
 
How to Write Great Kafka Connectors
How to Write Great Kafka ConnectorsHow to Write Great Kafka Connectors
How to Write Great Kafka Connectors
 
Event Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on HerokuEvent Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on Heroku
 
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, NutanixGuaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
 
Deploying and Operating KSQL
Deploying and Operating KSQLDeploying and Operating KSQL
Deploying and Operating KSQL
 
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsRunning Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
 

Similar to How did we move the mountain? - Migrating 1 trillion+ messages per day across data centers at PayPal | Lei Huang and Na Yang, PayPal

Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
confluent
 
Introducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building SocietyIntroducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building Society
confluent
 

Similar to How did we move the mountain? - Migrating 1 trillion+ messages per day across data centers at PayPal | Lei Huang and Na Yang, PayPal (20)

DevOpsCon 2020: The Past, Present, and Future of Cloud Native API Gateways
DevOpsCon 2020: The Past, Present, and Future of Cloud Native API GatewaysDevOpsCon 2020: The Past, Present, and Future of Cloud Native API Gateways
DevOpsCon 2020: The Past, Present, and Future of Cloud Native API Gateways
 
SoftwareCircus 2020 "The Past, Present, and Future of Cloud Native API Gateways"
SoftwareCircus 2020 "The Past, Present, and Future of Cloud Native API Gateways"SoftwareCircus 2020 "The Past, Present, and Future of Cloud Native API Gateways"
SoftwareCircus 2020 "The Past, Present, and Future of Cloud Native API Gateways"
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
 
AllTheTalks 2020: "The Past, Present, and Future of Cloud Native API Gateways"
AllTheTalks 2020: "The Past, Present, and Future of Cloud Native API Gateways"AllTheTalks 2020: "The Past, Present, and Future of Cloud Native API Gateways"
AllTheTalks 2020: "The Past, Present, and Future of Cloud Native API Gateways"
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
 
Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...
Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...
Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...
 
How to Migrate 100 Clusters from On-Prem to Google Cloud Without Downtime
How to Migrate 100 Clusters from On-Prem to Google Cloud Without DowntimeHow to Migrate 100 Clusters from On-Prem to Google Cloud Without Downtime
How to Migrate 100 Clusters from On-Prem to Google Cloud Without Downtime
 
Realtime stream processing with kafka
Realtime stream processing with kafkaRealtime stream processing with kafka
Realtime stream processing with kafka
 
Distributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola ScaleDistributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola Scale
 
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
 
GOTOpia 2020: "The Past, Present, and Future of Cloud Native API Gateways"
GOTOpia 2020: "The Past, Present, and Future of Cloud Native API Gateways"GOTOpia 2020: "The Past, Present, and Future of Cloud Native API Gateways"
GOTOpia 2020: "The Past, Present, and Future of Cloud Native API Gateways"
 
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
 
James Watters Kafka Summit NYC 2019 Keynote
James Watters Kafka Summit NYC 2019 KeynoteJames Watters Kafka Summit NYC 2019 Keynote
James Watters Kafka Summit NYC 2019 Keynote
 
CloudBuilders 2022: "The Past, Present, and Future of Cloud Native API Gateways"
CloudBuilders 2022: "The Past, Present, and Future of Cloud Native API Gateways"CloudBuilders 2022: "The Past, Present, and Future of Cloud Native API Gateways"
CloudBuilders 2022: "The Past, Present, and Future of Cloud Native API Gateways"
 
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
 
Introducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building SocietyIntroducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building Society
 
Microservices @ Work - A Practice Report of Developing Microservices
Microservices @ Work - A Practice Report of Developing MicroservicesMicroservices @ Work - A Practice Report of Developing Microservices
Microservices @ Work - A Practice Report of Developing Microservices
 
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

How did we move the mountain? - Migrating 1 trillion+ messages per day across data centers at PayPal | Lei Huang and Na Yang, PayPal

  • 1. How did We Move a Mountain? Migrating 1 trillion+ messages per day across data centers at PayPal
  • 2. Agenda PayPal Kafka @PayPal PayPal Kafka Platform Migration Migration Challenges and Solution Lessons learnt Future PayPal Kafka Platform
  • 3. PayPal PayPal 2021 business metrics Growth Metrics 19% YoY * Revenue * https://www.fool.com/earnings/call-transcripts/2021/07/29/paypal-holdings-pypl-q2-2021-earnings-call-transcr/ Company Facts 403 million > 1.1 trillion 35,867 Active consumer and merchant accounts Total payment volume in last 12 months Payment transaction per minute 27,700 Employees Q2,20 Q2,21 6.4B 5.26B 40% YoY * Total Payment Volume Q2,20 Q2,21 311B 222B 40% YoY * Active User Account Q2,20 Q2,21 403M 346M 27% YoY * Transaction Q2,20 Q2,21 4.74B 3.7B
  • 4. Kafka @PayPal Kafka footprint at PayPal Applications Kafka Clusters Kafka Brokers Messages/day 2021 Kafka Metrics 60+ 1000+ 1+ Trillion 800+ Kafka Apps 5000+ Kafka Topics 0.9 0.10 1.1 0.8 Kafka Journey 2.2 2015 2016 2017 2018 2019+
  • 5. Kafka @PayPal Kafka tech stacks at PayPal Languages Gimel Application Frameworks Multi-Tenant Multiple Regions & Availability Zones
  • 6. Kafka @PayPal Kafka use cases at PayPal Monitoring Metrics Streaming & Aggregation Log Aggregation Batch Processing User Activity Tracking Risk & Compliance Use Cases
  • 7. PayPal Data Center Migration • Migrate from eBay datacenter to dedicated PayPal datacenter • Consolidate and simplify PayPal’s North American data center • Scale up data center computing power and network bandwidth • Increase flexibility, scalability and reduce data center cost • Migrate datacenter without interrupting PayPal business • Fully transparent to PayPal customers • Exit eBay datacenter within a hard deadline Business objective to migrate
  • 8. M i r r o r M a k e r M i r r o r M a k e r PayPal Kafka Platform Migration Scope Kafka footprint and migration scope Kafka App Kafka App Kafka App Primary Data Center New Primary Data Center Analytics Data Center Kafka App Kafka App Kafka App Kafka App Kafka App Kafka App Mirror Maker Kafka Migration Ø 20+ Kafka clusters Ø 20+ Zookeeper clusters Ø 2500+ Kafka Topics Ø 60+ Mirror Maker Groups Secondary Data Center Kafka App Kafka App Kafka App Mirror Maker
  • 9. PayPal Kafka Platform Migration Approach The steps we follow for this migration Strategy Planning Execution Verification • Define requirement • Budget allocation • Design process • Setup timeline • Plan for each component (brokers, zookeepers, mirrormakers) migration • Plan for each cluster and app group migration • Team allocation • Phased execution • Risk analysis and control • Customer notification and cluster monitoring • Detailed execution steps and rollback plan • Refine execution steps from previous migration experien ces • Cluster-level verification • Application- level verification
  • 10. PayPal Kafka Platform Migration Strategy • Migration requirement • No business down time during the migration • No introduction of data loss and message duplication during the migration • Migration is managed by the Kafka Infra team and transparent to our customers • Hardware capacity and specs within budget • Sufficient Kafka cluster capacity in the new datacenter • Carefully-chosen hardware configuration for optimal performance • Well-defined migration process • Brokers, zookeepers and mirror makers migration • Well-planned migration timeline for each phase • Meet the PayPal datacenter migration deadline Migration requirement, buget, process and timeline
  • 11. PayPal Kafka Platform Migration Key Building Blocks Topic lookup service enabled transparent migration of 2500+ Kafka topics Kafka Publisher Kafka Consumer Topic Lookup Service Kafka Metadata DB 5. Publish messages 5. Consume messages 1. Request bootstrap servers using topic name, colo, security zone 1. Request bootstrap servers using topic name, colo, security zone 4. Use bootstrap servers to connect to Kafka cluster 4. Use bootstrap servers to connect to Kafka cluster 2. Bootstrap server lookup 3. Return bootstrap servers list 3. Return bootstrap servers list PayPal Kafka client PayPal Kafka client
  • 12. PayPal Kafka Platform Migration Key Building Blocks Cross datacenter Kafka cluster and zookeeper quorum enabled seamless migration Kafka Cluster Old DC Kafka Cluster Producer Consumer Zookeeper Quorum Mirror Makers New DC Old DC New DC Producer Consumer Kafka Cluster Kafka Cluster Producer Consumer Zookeeper Quorum Mirror Makers Mirror Makers Kafka Cluster Producer Consumer Zookeeper Quorum Mirror Makers Kafka Cluster Flex down Flex up 1 2 Topic partition reassignment Applications move to new bootstrap servers Zookeeper migration Mirror maker migration 3
  • 13. Kafka Cluster PayPal Kafka Platform Migration Key Building Blocks Centralized Kafka monitoring system helped tracking migration progress Kafka Cluster Kafka Producers Kafka Consumers SignalFX Monitoring Metrics Monitoring Dashboard View Kafka Brokers
  • 15. Migration Challenges • Brokers are stateful • Bootstrapping can be hard Kafka cluster migration is hard Image source: https://jack-vanlightly.com/sketches/2018/10/2/kafka-vs-pulsar-rebalancing-sketch
  • 16. Migration Challenges • Complex ecosystem • Large number of topics and client applications • Application ownership across multiple verticals • Multi-tenant topics and clusters • Dual-role client applications Project Specific complexities
  • 17. Migration Challenges • Not a strict 1 to 1 mapping • Multiple clusters mapping to a single cluster • Application deployment pattern changes Project Specific complexities Kafka Cluster 1 Kafka Cluster 2 Kafka Cluster 3 Kafka Cluster A Kafka Cluster B Old DC New DC
  • 18. Recap: migration challenges • Kafka cluster migration is hard • Brokers are stateful • Bootstrapping can be hard • Project specific complexities • PayPal Kafka ecosystem is complex • Not a strict 1 to 1 mapping migration • Goal and requirement • No data loss • No introduction of message duplicates • Minimum application disturbance
  • 19. Cluster and data migration Exploration: 1-way mirroring 1. MirrorMaker for T from old to new 2. Consumer migration a. Shutdown completely in old DC @t1 b. Consumer group offset on T in new DC to @t1-buffer c. Start in new DC 3. Producer migration New DC Pub T Old DC Pub Sub T Sub MM 1 2 3 Pros: • No app config change • Moderate capacity and network overhead Cons: • Need to enforce migration order
  • 20. Exploration: 2-way mirroring 1. Set up 2-way mirroring for T a. T à new.T on old DC b. T à old.T on new DC 2. Consume from T and *.T in both new and old 3. Consumer migration and/or producer migration New DC Pub Old DC Pub Sub Sub T New.T T Old.T MM 1 Old.T New.T 2 3 3 Cluster and data migration Pros: • Migration in any order or sequence Cons: • Application config change • More capacity and network overhead
  • 21. • One-way mirroring • No application config change • Less additional capacity • Less network overhead • Need to enforce migration order Pros and cons of the mirroring approaches Cluster and data migration • Two-way mirroring • Migration in any order or sequence • Need application config change • More capacity and network overhead • Problems with both mirroring approaches • Introduction of message duplicates • Increased latency
  • 22. Cluster mitosis • Broker migration 1 2 3 4 5 6 1 2 3 4 5 6 1. ls /brokers/ids [1,2,3] 2. ls /brokers/ids [1,2,3,4,5,6] Expand 1. Pair up old and new brokers 2. Current partition replica assignment {“version”:1,”partitions”:[{“topic”:”T”,”partition”:0,”replicas”:[3,1,2]}]} 3. Optimized reassignment plan {“version”:1,”partitions”:[{“topic”:”T”,”partition”:0,”replicas”:[6,4,5]}]} 4. Repartition Move 1. ls /brokers/ids [1,2,3,4,5,6] 2. ls /brokers/ids [1,2,3] Shrink Cluster and data migration
  • 23. • Comparing to mirroring • No data loss • No introduction of message duplicates • No impact to latency • Transparent to applications Cluster mitosis Cluster and data migration
  • 24. Cluster mitosis • Zookeeper migration 0. A zookeeper cluster of 2N+1 node 1. Add 2N new nodes à 4N+1 quorum [Update and rolling restart brokers to connect to only new zookeeper nodes] 2. Remove 2N old nodes, à 2N+1 quorum 3. Replace the last old node L L L L 0 1 2 3 Cluster and data migration
  • 25. Bootstrapping to new nodes Metadata Store Old.1 Old.2 Old.3 New.1 New.2 New.3 Old DC New DC 1. Before all topic partitions are moved to new nodes: a. Metadata b. App in old DC: no change, no restart needed c. App moved and started in new DC 2. Broker repartition to move all topic partitions to new nodes 3. After all topic partitions are moved to new nodes: a. Metadata b. Client app restart in both old and new DC Kafka Topic Lookup Service Kafka Topic Lookup Service PayPal Kafka Library Client-app PayPal Kafka Library Client-app Client application migration
  • 26. Exploration: flipping without restarting 1. Preparation a. Use ip addresses for inter broker communication b. Set leader_epoch on new cluster c. Restart new cluster controller 2. Merging/Execution a. Clean shut down all brokers in old cluster b. Make CNAME change c. Application connections flip 3. Clean up a. Restart old cluster b. Start MirrorMaker to drain unconsumed messages c. Shut down old cluster completely New.1 New.2 New.3 Client-app Client-app Client-app New.1 with CNAME: Old.1 New.2 with CNAME: Old.2 New.3 with CNAME: Old.3 Client application migration Old.1 Old.2 Old.3
  • 27. Tools, optimization, etc. • Ansible scripts for host validation and deployment • Thoroughness • Standardize the process • Deployment automation • Repartition optimization • Start migration of topics with low traffic first • 1 to 1 mapping between new and old brokers
  • 28. Lessons Learnt • Bootstrapping • Critical for Kafka migration • Can become complicated in multi-tenant eco-system • An independent topic lookup service with flexibility of metadata control helps a lot • Metadata • Well-designed schema • Upfront and proactive tracking and collecting of client metadata
  • 29. Lessons Learnt • Metrics • Indispensable during migration • Most important metrics during migration Categories Sample Metrics Volume Tracking MessagesIn/OutPerSec BytesIn/OutPerSec Partitions ISR Expands/Shrinks AtMinIsrPartitions count/Single Replicas UnderReplicatedPartitions/OfflinePartitions System CPU, Memory, FD count, iowait Page swapping Heap/thread dumps Other Active Controller Count ZK Connection/Session Timeouts Inter-broker, broker->zk network latencies
  • 30. Future Future PayPal Kafka Platform • More reliable and scalable • More resilient • Become cloud native • Cloud and On-prem hybrid platform