(Todd Palino, LinkedIn) Kafka Summit SF 2018
What do you really know about how to monitor a Kafka cluster for problems? Is your most reliable monitoring your users telling you there’s something broken? Are you capturing more metrics than the actual data being produced? Sure, we all know how to monitor disk and network, but when it comes to the state of the brokers, many of us are still unsure of which metrics we should be watching, and what their patterns mean for the state of the cluster. Kafka has hundreds of measurements, from the high-level numbers that are often meaningless to the per-partition metrics that stack up by the thousands as our data grows.
We will thoroughly explore three key monitoring concepts in the broker, that will leave you an expert in identifying problems with the least amount of pain:
-Under-replicated Partitions: The mother of all metrics
-Request Latencies: Why your users complain
-Thread pool utilization: How could 80% be a problem?
We will also discuss the necessity of availability monitoring and how to use it to get a true picture of what your users see, before they come beating down your door!
6. Monitoring is not Alerting
• Collect everything
• Alert on nothing
• Events are better than metrics
• Tests are better than alerts
• Sleep is best in life
7. • What’s an SLA?
• Availability
• Latency
• Customer Guarantees
Service
Level
Objectives
9. The Three Metrics You Need to Know
Partitions that are not
fully replicated within
the cluster
URP
The overall utilization
of an Apache Kafka
broker
Request Handlers
How long requests are
taking, in which stage
of processing
Request Timing
17. Request Handler Problems
• Anything that causes Kafka
to expend CPU cycles
• Includes problems related to
failing disks (IO wait)
• SSL and compression work
both can use a lot of CPU
CPU Time Timeout Deadlock
• Most often due to failing to
process controller requests
• Intra-cluster requests tend to
be bound by partition counts
• Rapidly starves the pool of
threads
• Should always be a code bug
• Usually looks exactly like a
timeout problem
• Rare, but hard to identify
19. Request Handler Problems
• Anything that causes Kafka
to expend CPU cycles
• Includes problems related to
failing disks (IO wait)
• SSL and compression work
both can use a lot of CPU
CPU Time Timeout Deadlock
• Most often due to failing to
process controller requests
• Intra-cluster requests tend to
be bound by partition counts
• Rapidly starves the pool of
threads
• Should always be a code bug
• Usually looks exactly like a
timeout problem
• Rare, but hard to identify
21. Brokers Don’t Shouldn’t Do Compression
• Kafka brokers are running a new version
• Message format has been set to the new version
• Clients haven’t upgraded
Up Conversion Down Conversion
• Kafka brokers are running a new version
• Message format is set to an older version due to
clients
• Producer clients update to new version
22. Request Timing
• Remote – Waiting for other brokers
• Response Queue – Waiting to send
• Response Send - Send to client
• Total – Request handling, end to end
• Request Queue – Waiting to process
• Local – Work local to the broker
29. Operating System
And Hardware
Metrics
• What do they mean?
• What application is causing it?
• Don’t alert unless:
• 100% clear signal
• 100% clear response
34. If You Remember Nothing Else…
• Define your service level objectives
• Monitor your service level objectives
• Metrics that cover many problems are noisy
• Buy Kafka: The Definitive Guide
35. Getting (and Giving) Help
• Kafka Monitor
• https://github.com/linkedin/kafka-monitor
• Burrow
• https://github.com/linkedin/Burrow
• Cruise Control
• https://github.com/linkedin/cruise-control
• kafka-tools
• https://github.com/linkedin/kafka-tools
LinkedIn Open Source Get Involved
• Community
• users@kafka.apache.org
• dev@kafka.apache.org
• Bugs and Work:
• https://issues.apache.org/jira/projects/KAFKA