SlideShare verwendet Cookies, um die Funktionalität und Leistungsfähigkeit der Webseite zu verbessern und Ihnen relevante Werbung bereitzustellen. Wenn Sie diese Webseite weiter besuchen, erklären Sie sich mit der Verwendung von Cookies auf dieser Seite einverstanden. Lesen Sie bitte unsere Nutzervereinbarung und die Datenschutzrichtlinie.
SlideShare verwendet Cookies, um die Funktionalität und Leistungsfähigkeit der Webseite zu verbessern und Ihnen relevante Werbung bereitzustellen. Wenn Sie diese Webseite weiter besuchen, erklären Sie sich mit der Verwendung von Cookies auf dieser Seite einverstanden. Lesen Sie bitte unsere unsere Datenschutzrichtlinie und die Nutzervereinbarung.
Kafka gives you a collection of tools and components and tells you “figure things out”. The two broad approaches are Sync and Async. I’ll show one of each. Either one of these approaches can be tailored for your specific situation – external constraints, specific requirements, additional systems involved. he goal of this presentation is to give you ideas and inspiration as you are building your own. We are happy to help.
The easiest protection from one DC getting nuked
# for what is “near by”
Get a third data center
We had a consumer that was supposed to notify the warehouses about large orders. We want it to continue doing its job… just in a different DC now.
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
One Data Center is Not Enough
Scale and Availability of Apache Kafka in Multiple Data Centers
• Kafka cluster failure
• Major storage / network outage
• Entire DC is demolished
• Floods and Earthquakes
Disaster Recovery Plan:
“When in trouble
or in doubt
run in circles,
scream and shout”
Disaster Recovery Plan:
When This Happens Do That
Kafka cluster failure Failover to a second cluster in same data
Major storage / network Outage Failover to a second cluster in another “zone”
in same building
Entire data-center is demolished Single Kafka cluster running in multiple near-by
data-centers / buildings.
Flood and Earthquakes Failover to a second cluster in another region
There is no such thing
as a free lunch
Anyone who tells you differently
is selling something
The same event will not
appear in two DCs at the
exact same time.
Things to ask:
• What are the guarantees in an event of unplanned failover?
• What are the guarantees in an event of planned failover?
• What is the process for failing back?
• How many data-centers are required?
• How does the solution impact my production performance?
• What are the bandwidth requirements between the data-centers?
Every solution needs to balance
these trade offs
Kafka takes DIY approach
The easy way
• Take 3 nearby data centers.
• Single digit ms latency is good
• Install at least 1 Zookeeper in each
• Install at least one Kafka broker in each
• Configure each DC as a “rack”
• Configure acks=all, min.isr=2
• Easy to set up
• Failover is “business as usual”
• Sync replication – only method to guarantee
no loss of data.
• Need 3 data centers nearby
• Cluster failure is still a disaster
• Higher latency, lower throughput compared
to “normal” cluster
• Traffic between DCs can be bottleneck
• Costly infrastructure
Want sync replication but only
two data centers?
Solution I hesistate because…
2 ZK nodes in each DC and “observer”
Did anyone do this before?
3 ZK nodes in each DC and manually
reconfigure quorum for failover
• You may lose ZK updates during
• Requires manual intervention2 separate ZK cluster + replication
Solutions I can’t recommend:
Most companies don’t do stretch.
• Only 2 data centers
• Data centers are far
• One cluster isn’t safe enough
• Not into “high latency”
So you want to run
2 Kafka clusters
And replicate events
• Active-Active is efficient
you use both DCs
• Active-Active is easier
because both clusters are
• Active-Passive has lower
• Active-Passive requires
Unfortunately, this is not that simple
1. There is no guarantee that offsets are identical in the two data centers.
Event with offset 26 in NYC can be offset 6 or offset 30 in ATL.
2. Replication of each topic and partition is independent. So..
1. Offset metadata may arrive ahead of events themselves
2. Offset metadata may arrive late
Nothing prevents you from replicating offsets topic and using it. Just be realistic
about the guarantees.
If accuracy is no big-deal…
1. If duplicates are cool – start from the beginning.
• Writing to a DB
• Anything idempotent
• Sending emails or alerts to people inside the company
2. If lost events are cool – jump to the latest event.
• Clickstream analytics
• Log analytics
• “Big data” and analytics use-cases
Personal Favorite – Time-based Failover
• Offsets are not identical, but…
3pm is 3pm (within clock drift)
• Relies on new features:
• Timestamps in events! 0.10.0.0
• Time-based indexes! 0.10.1.0
• Force consumer to timestamps tool! 0.11.0.0
How we do it?
1. Detect Kafka in NYC is down. Check the time of the incident.
• Even better:
Use an interceptor to track timestamps of events as they are
consumed. Now you know “last consumed time-stamp”
2. Run Consumer Groups tool in ATL and set the offsets for “following-orders”
consumer to time of incident (or “last consumed time”)
3. Start the ”following-orders” consumer in ATL
4. Have a beer. You just aced your annual failover drill.
• Above all – practice
• Constantly monitor replication lag. High enough lag and everything is useless.
• Also monitor replicator for liveness, errors, etc.
• Chances are the line to the remote DC is both high latency and low throughput.
Prepare to do some work to tune the producers/consumers of the replicator.
• RTFM: http://docs.confluent.io/3.3.0/multi-dc/replicator-tuning.html
• Replicator plays nice with containers and auto-scale. Give it a try.
• Call your legal dept. You may be required to encrypt everything you replicate.
• Watch different versions of this talk. We discuss more architectures and more ops concerns.