One of the key metrics to monitor when working with Apache Kafka, as a data pipeline or a streaming platform, is Consumer Groups Lag.
Lag is the delta between the last produced message and the last committed message of a partition. In other words, lag indicates how far behind your application is in processing up-to-date information.
For a long time, we used our own service to keep track of these metrics, collect them and visualize them. But this didn’t scale well.
You had to perform many manual operations, redeploy it and to do other tedious manual tasks, but most importantly, the biggest gap for us, was that its output was represented in absolute numbers (e.g - your lag is 30K), which basically tells you nothing as a human being.
We understood that we had to find a more suitable solution that will give us better visibility and will allow us to measure the lag in a time-based format that we all understand.
In this talk, I’m going to go over the core concepts of Kafka offsets and lags, and explain why lag even matters and is an important KPI to measure. I’ll also talk about the kind of research we did to find the right tool, what the options in the market were at the time, and eventually why we chose Linkedin’s Burrow as the right tool for us. And finally, I’ll take a closer look at Burrow, its building blocks, how we build and deploy it, how we monitor better with it, and eventually the most important improvement - how we transformed its output from numbers to time-based metrics.
10. __consumer_offsets
Offsets can be stored
either in Zookeeper
or in a special topic
called
__consumer_offsets
https://cwiki.apache.org/confluence/display/KAFKA/Offset+Management
11. Zookeeper is not
built for high-write
load such as offset
storage
https://cwiki.apache.org/confluence/display/KAFKA/Offset+Management
12. A consistent, fault
tolerant and
partitioned way of
storing offsets
https://cwiki.apache.org/confluence/display/KAFKA/Offset+Management
__consumer_offsets
34. Automatic
No need to change
config file
Filter consumer groups
based on Regex
Scalable
Small footprint
Easy to scale
Simple, easy to use
Support both ZK and
__consumer_offsets
topic
What we wanted to achieve?
35. The “raw” metrics that we
looked for are:
> Per Partition
> Per Consumer Group
> Per Topic
What we wanted to achieve?
37. Linkedin - Burrow
> A LinkedIn Project
> More than 2.5K stars
> Active community
> Production ready
Lightbend - Kafka Lag Exporter
> Smart
> Time Based
> Still in Beta
Zalando - Remora
> Inspired by Burrow
> CloudWatch & DataDog
integration
> Wrap around Kafka CLI
What are the options?
56. Time Lag - How did we do it?
Diff ( Last_Consumed , Last_Produced )
Producer Rate
57. Time Lag - How did we do it?
Timeline
Time: 12:00AM
Msg_offset: 134
Time: 12:10AM
Msg_offset: 144
Time: 12:20AM
Msg_offset: 154
Consumer
Producer
Lag
59. Smart Alerts
Dynamic alerts based
on lag and retention
Decoupling
As we grow, Burrow will
be deployed per cluster
Migration
Migrating crucial part
of the infrastructure is
hard
What’s next?