For several years, LinkedIn has been using Kafka MirrorMaker as the mirroring solution for copying data between Kafka clusters across data centers. However, as LinkedIn data continued to grow, mirroring trillions of Kafka messages per day across data centers uncovered the scale limitations and operability challenges of Kafka MirrorMaker. To address these, we have developed a new mirroring solution, built on top our stream ingestion service, Brooklin. Brooklin’s mirroring solution aims to provide improved performance and stability, while facilitating better management via finer control of data pipelines. Through flushless Kafka produce, dynamic management of data pipelines, per-partition error handling and flow control, we are able to increase throughput, better withstand consume and produce failures and reduce overall operating costs. As a result, we have eliminated the major pain points of Kafka MirrorMaker.
In this talk, we will dive deeper into the challenges LinkedIn has faced with Kafka MirrorMaker, how we tackled them with Brooklin and our plans for iterating further on this new mirroring solution.
4. • Aggregating data from all data centers
• Moving data from offline data stores into online
environments
• Moving data between LinkedIn and external
cloud services
Use Cases
12. KMM does not scale well
● # of KMM clusters = (# of data centers)2 x # of Kafka clusters
● More consumer-producer pairs → need to provision more hardware
13. KMM is difficult to operate
● Static configuration file per KMM cluster
● Changes require deploying to 100+ clusters
17. Brooklin Mirroring
● Optimized for stability and operability
● Built on top of our streaming data pipelines service, Brooklin
● Brooklin Kafka mirroring has been in production for 1+ years
● Open-sourced Brooklin last month
25. Kafka mirroring built on Brooklin
DestinationsSources
Messaging systems
Microsoft
EventHubs
Messaging systems
Microsoft
EventHubs
Databases Databases
26. Kafka mirroring built on Brooklin
DestinationsSources
Messaging systems
Microsoft
EventHubs
Messaging systems
Microsoft
EventHubs
Databases Databases
36. On-demand Diagnostics
Brooklin
Engine
Diagnostics
Rest API
ZooKeeper
getAllStatus GET /diag?datastream=mm_DC1-tracking_DC2-aggregate-tracking
host1.prod.linkedin.com:
datastream: mm_DC1-tracking_DC2-aggregate-tracking
assignedTopicPartitions: [topicA-0, topicA-3, topicB-0, topicB-2]
autoPausedPartitions: [{topicA-3: {reason: SEND_ERROR, description: failed to produce messages from this
partition}}]
manuallyPausedPartitions: []
host2.prod.linkedin.com:
datastream: mm_DC1-tracking_DC2-aggregate-tracking
assignedTopicPartitions: [topicA-1, topicA-2, topicB-1, topicB-3]
autoPausedPartitions: []
manuallyPausedPartitions: []
37. Error Isolation
● Manually pause and resume mirroring at every level
○ Entire pipeline, topic, topic-partition
● Brooklin can automatically pause mirroring of partitions
○ Auto-resumes the partitions after configurable duration
● Flow of messages from other partitions continue
38. Processing Loop
while (!shutdown) {
records = consumer.poll();
producer.send(records);
if (timeToCommit) {
producer.flush();
consumer.commit();
}
}
39. Producer flush can be expensive
while (!shutdown) {
records = consumer.poll();
producer.send(records);
if (timeToCommit) {
producer.flush();
consumer.commit();
}
}