This presentation focuses on how to integrate all these components into an enterprise environment and what things you need to consider as you move into production.
We will touch on the following topics:
- Patterns for integrating with existing data systems and applications
- Metadata management at enterprise scale
- Tradeoffs in performance, cost, availability and fault tolerance
- Choosing which cross-datacenter replication patterns fit with your application
- Considerations for operating Kafka-based data pipelines in production
2. 2
Apache Kafka: Online Talk Series
Part 1: September 27 Part 2: October 6 Part 3: October 27
Part 4: November 17 Part 6: December 15Part 5: December 1
Introduction To Streaming
Data and Stream
Processing with Apache
Kafka
Deep Dive into
Apache Kafka
Demystifying
Stream Processing with
Apache Kafka
Data Integration with
Apache Kafka
A Practical Guide to
Selecting a Stream
Processing Technology
https://www.confluent.io/apache-kafka-talk-series/
3. 3
Agenda
• Kafka Basics
• Tuning Kafka For Your Application
• Data Balancing
• Spanning Multiple Datacenters
4. 4
Agenda
• Kafka Basics
• Tuning Kafka For Your Application
• Data Balancing
• Spanning Multiple Datacenters
8. 8
Agenda
• Kafka Basics
• Tuning Kafka For Your Application
• Data Balancing
• Spanning Multiple Datacenters
9. 9
Two Example Apps
• User activity tracking
• Collect page view events while users are browsing
our web and mobile storefronts
• Persist the data to HDFS for subsequent use in
recommendation engine
• Inventory adjustments
• Track sales, maintain inventory, and re-order on-
demand
10. 10
Application Priorities
• User activity tracking
• High throughput (100x the sales stream)
• Availability is most important
• Low retention required - 3 days
• Inventory adjustments
• Relatively low throughput
• Durability is most important
• Long retention required – 6 months
12. 12
Partition Count
- Partitions are the unit of consumer parallelism
- Over-partition your topics (especially keyed topics)
- Easy to add consumers but hard to add partitions for keyed topics
- Kafka can support ~10s k partitions
13. 13
Partition Count
- High Throughput (User activity tracking)
- Large number of partitions (~100)
- Fewer Resources (Inventory adjustments)
- Smaller number of partitions (< 50)
14. 14
Replication Factor
- More replicas require more storage, disk I/O, and network bandwidth
- More replicas can tolerate more failures
topic1-part1
logs
broker 1
topic1-part2
logs
broker 2
topic2-part2
topic2-part1
logs
broker 3
topic1-part1
logs
broker 4
topic1-part2
topic2-part2 topic1-part1 topic1-part2
topic2-part1
topic2-part2
topic2-part1
16. 16
Retention
- Retention time can be set per topic
- Longer retention times require more storage (imagine that!)
- Longer retention allows consumers to rewind further back in time
- Part of the consumer’s SLA!
17. 17
Retention
- Less Storage (User activity tracking)
- log.retention.hours=72 (3 days)
- Longer Time Travel (Inventory adjustments)
- log.retention.hours=4380 (6 months)
- Default is 7 days
21. 21
Producer Acknowledgements on Send
broker 1
producer
leader
broker 2
follower
broker 3
follower
4
2
2
3
commit
ack
When producer receives ack Latency Durability on failures
acks=0 (no ack) no network delay some data loss
acks=1 (wait for leader) 1 network roundtrip a few data loss
acks=all (wait for committed) 2 network roundtrips no data loss
topic1-part1 topic1-part1 topic1-part1
consumer
1
29. 29
Agenda
• Kafka Basics
• Tuning Kafka For Your Application
• Data Balancing
• Spanning Multiple Datacenters
30. 30
Replica Placement
• Partitions are replicated
• Replicas are spread evenly across the cluster
• Only when the topic is created or modified
topic1-part1
logs
broker 1
topic1-part2
logs
broker 2
topic2-part2
topic2-part1
logs
broker 3
topic1-part1
logs
broker 4
topic1-part2
topic2-part2 topic1-part1 topic1-part2
topic2-part1
topic2-part2
topic2-part1
31. 31
Replica Placement
• Over time broker load and storage become unbalanced
• Initial replica placement does not account for topic throughput or retention
• Adding or removing brokers
topic1-part1
broker 1
topic1-part2
broker 2
topic2-part2
topic2-part1
broker 3
topic1-part1
broker 4
topic1-part2
topic2-part2
topic1-part1
topic1-part2
topic2-part1
topic2-part2
topic2-part1
broker 5
32. 32
Replica Reassignment
• Create plan to rebalance replicas
• Upload new assignment to the cluster
• Kafka migrates replicas without disruption
topic1-part1
broker 1
topic1-part2
broker 2
topic2-part2
topic2-part1
broker 3
topic1-part1
broker 4
topic1-part2
topic1-part1
topic1-part2
topic2-part1
topic2-part2
broker 5
topic2-part1
topic2-part2
topic1-part1
broker 1
topic1-part2
broker 2
topic2-part2
topic2-part1
broker 3
topic1-part1
broker 4
topic1-part2
topic2-part2
topic1-part1
topic1-part2
topic2-part1
topic2-part2
topic2-part1
broker 5
Before
After
33. 33
Data Balancing: Tricky Parts
• Creating a good plan
• Balance broker disk space
• Balance broker load
• Minimize data movement
• Preserve rack placement
• Movement of replicas can overload I/O and bandwidth resources
• Use replication quota feature in 0.10.1
34. 34
Data Balancing: Solutions
• DIY
• kafka-reassign-partitions.sh script in Apache Kafka
• Confluent Enterprise Auto Data Balancing
• Optimizes storage utilization
• Rack awareness and minimal data movement
• Leverages replication quotas during rebalance
35. 35
Agenda
• Kafka Basics
• Tuning Kafka For Your Application
• Data Balancing
• Spanning Multiple Datacenters
36. 36
Use cases
• Disaster Recovery
• Replicate data out to geo-localized data centers
• Aggregate data from other data centers for analysis
• Part of hybrid cloud or cloud migration strategy
38. 38
Stretched Cluster
• Low-latency links between 3 DCs. Typically AZs in a single AWS region.
• Applications in all 3 DCs share the same cluster and handle failures automatically.
• Relies on intra-cluster replication to copy data across DCs (replication.factor >= 3)
• Use rack awareness in Kafka 0.10; manual partition placement otherwise
Kafka
producers
consumer
s
AZ 1 AZ 3AZ 2
producersproducers
consumer
s
consumer
s
AWS
Region
39. 39
Mirroring Across Clusters
• Separate Kafka clusters in each DC. Mirroring process copies data between them.
• Several variations of this pattern. Some require manual intervention on failover and recovery.
40. 40
How to Mirror Across Clusters
• MirrorMaker tool in Apache Kafka
• Manual topic creation
• Manual sync of topic configuration
• Confluent Enterprise Multi-DC
• Dynamic topic creation at the destination
• Automatic sync for topic configurations (including access controls)
• Can be configured and managed from the Control Center UI
• Leverages Connect API
41. 41
More Information: Tuning Tradeoffs
• Apache Kafka and Confluent Documentation
• When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka
• Gwen Shapira and Jeff Holoman - https://www.confluent.io/kafka-summit-2016-ops-when-it-absolutely-
positively-has-to-be-there/
• Chapter 6: Reliability Guarantees
• Neha Narkhede, Gwen Shapira, Todd Palino – Kafka: The Definitive Guide
• Confluent Operations Training
42. 42
More Information: Multi-DC
• Building Large Scale Stream Infrastructures Across Multiple Data Centers with Apache Kafka –
Jun Rao
• Video: https://www.youtube.com/watch?v=XcvHmqmh16g
• Slides: http://www.slideshare.net/HadoopSummit/building-largescale-stream-infrastructures-
across-multiple-data-centers-with-apache-kafka
• Confluent Enterprise Multi-DC - https://www.confluent.io/product/multi-datacenter/
43. 43
More Information: Metadata Management
• Yes, Virginia, You Really Do Need a Schema Registry
• Gwen Shapira - https://www.confluent.io/blog/schema-registry-kafka-stream-
processing-yes-virginia-you-really-need-one/
I’m an engineer at Confluent. In a previous job, I’ve taken Kafka from proof of concept all the way to production with some pipelines handling more than 5B events per day. My goal is to share what I think are the most important things to know when taking Kafka to production.
This is the last talk in a series of 6. The previous talks cover components of the Kafka ecosystem and stream processing in general. This talk is about taking Kafka to production. In general, I think Kafka is pretty easy to operate and has great documentation compared to other technologies I’ve worked with. Since we cannot cover everything, I want to focus on the important concepts and hopefully give you enough insight that you what to plan for and where to find more information.
Patterns for integrating with existing data systems and applications – covered in a previous talk
Metadata management at enterprise scale – I’ll include a link at the end to a great blog post by Gwen Shapira
- Review the basics
Talk about tuning Kafka – what tradeoffs you can make
Data balancing
Spanning Multiple Datacenters
First, I want to review a few basics of Kafka to make sure we have enough context for the rest of the talk. This is be a quick review if you’ve seen other talks in the series. If not, that’s ok. You should be able to follow along.
Kafka is a streaming platform. A streaming platform can be THE common point of data integration across an organization. It allows the teams/systems within the organization share data in realtime and react as fast as necessary. It allows teams to work together without tight coupling of their services.
Kafka has some key characteristics that make it well-suited to being a streaming platform:
First it scales well and cheaply. Very efficient. You can do hundreds of MB/sec of writes per server and can have many servers.
Kafka doesn’t get slower as you store more data in it --- this is a huge win if you’ve operated other data systems
Distributed by design – replication, fault tolerance, partitioning, elastic scaling
Strong guarantees around ordering and durability
Has some unique features such as compacted topics that lets it handle some unique cases.
Has enterprise features like fine-grain security controls
Brokers hold data
Topics are logical streams that are broken into partitions
Partitions are the unit of parallelism for consumers
Topic-partitions are replicated and spread over multiple brokers
Producers and consumers read from brokers
Kafka relies on ZooKeeper for it’s own internal cluster management
Deployment is pretty easy. There are only a few compoents. They run as JVM processes.
No downtime
In distributed systems, there are tradeoffs. The goal of this section is to highlight the tradeoffs you can make to tune Kafka to match with your application's priorities. Getting the most out of Kafka.
You can imagine that we have a demand model to - match supply with demand while
- keeping our inventory as low as possible
What knobs do we have in Kafka to match these priorities?
We’re going to look at each of these and how to apply them to our example applications
Resources vs. Throughput
F replicas can tolerate f-1 failures
Topic1 has 3 replicas in this example spread over different brokers
Cost vs. Availability
F replicas can tolerate f-1 failures
Storage vs. Time Travel
Storage vs. Time Travel
Latency vs. Throughput
My experiments showed 4x compression ratio with lz4 even with Avro data
Latency vs. Throughput
Compression works on compacted topics now too
Latency vs. Durability
Latency vs. Durability
Before we get to the next knob, we need to review the idea of In-Sync Replicas.
Availability vs. Durability
Availability vs. Durability
Availability vs. Durability
This should be very rare but in a severe outage situation, some applications prefer automatic recover even if data is lost
Availability vs. Durability
This is a good place to start and adjust down if you find you need to optimize further
With batching and compression, you should be able to get very good throughput and safety
Now let’s assume that you’ve got a cluster up and running. You’ve setup good component-level monitoring so you tell that ZooKeeper and Kafka are healthy. You’ve tuned it for your application priorities. Kafka handles failures very smoothly and requires little attention. I’ve run it in production at a previous job and had a broker die without any interruption to the application (handling > 4B events/day).
However, there is some maintenance that you have to do and it’s around data balancing so I think it’s important to understand this and plan for it.
In the example
Broker 2 is under-utilized and Broker 5 is not being utilized at all
We’ve heard from many customers that this is a pain point
- Took us 2 weeks
We’ve talked about running Kafka reliably in a single data center. Another important consideration for putting Kafka in production is how to handle multiple datacenters. The topic is too deep to cover in detail so the goal of this section is to give an introduction and motivate you too watch an excellent talk on this subject by Confluent Co-Founder Jun Rao.
Simplest setup for failure handling but does not work across regions
There are a number of variations to this and some of them require manual intervention on failure recovery. Details in Jun’s talk. Please watch it. Over time, you’ll probably need to support multiple replication patterns to match different use cases.
This picture shows an example of 1) aggregating data from other DCs for analytics and 2) cross-replicating between DCs so they can both see each others data.
The goal of this section is to highlight the tradeoffs you can make to align Kafka with your application's priorities