SlideShare a Scribd company logo
1 of 36
Download to read offline
URP? Excuse You!
Todd Palino
Senior Staff Engineer, Site Reliability
LinkedIn
• What is Kafka
• Encyclopedia of Monitoring
• Automation
What This
Talk Is Not
Why Talk About
Monitoring?
Messages per Day at LinkedIn
What is Monitoring (not)?
Monitoring is not Alerting
• Collect everything
• Alert on nothing
• Events are better than metrics
• Tests are better than alerts
• Sleep is best in life
• What’s an SLA?
• Availability
• Latency
• Customer Guarantees
Service
Level
Objectives
Key Kafka Metrics
The Three Metrics You Need to Know
Partitions that are not
fully replicated within
the cluster
URP
The overall utilization
of an Apache Kafka
broker
Request Handlers
How long requests are
taking, in which stage
of processing
Request Timing
Under-Replicated Partitions
• Highly discussed
• Overall cluster health
• Replication is a consumer and producer
Under-Replicated Partitions
EXAMPLE: FAILED BROKER
Under-Replicated Partitions
EXAMPLE: CONSUMER PROBLEMS
Under-Replicated Partitions
EXAMPLE: PRODUCER PROBLEMS
Under-Replicated Partitions
• Overrated
• Doesn’t map to SLO
• Often not actionable
• Collect, but don’t alert
Everybody
In The
Pool
• Specialized thread pools
• Clients deal with network and
request pools
• Request handlers do most of the
work
Request
Handlers
• Decode and validate
• Perform task
• Wait for other brokers
• Assemble response
Request Handler Problems
• Anything that causes Kafka
to expend CPU cycles
• Includes problems related to
failing disks (IO wait)
• SSL and compression work
both can use a lot of CPU
CPU Time Timeout Deadlock
• Most often due to failing to
process controller requests
• Intra-cluster requests tend to
be bound by partition counts
• Rapidly starves the pool of
threads
• Should always be a code bug
• Usually looks exactly like a
timeout problem
• Rare, but hard to identify
Request Handler Problems
EXAMPLE: TIMEOUT OR DEADLOCK
Request Handler Problems
• Anything that causes Kafka
to expend CPU cycles
• Includes problems related to
failing disks (IO wait)
• SSL and compression work
both can use a lot of CPU
CPU Time Timeout Deadlock
• Most often due to failing to
process controller requests
• Intra-cluster requests tend to
be bound by partition counts
• Rapidly starves the pool of
threads
• Should always be a code bug
• Usually looks exactly like a
timeout problem
• Rare, but hard to identify
Brokers Don’t Do Compression
Brokers Don’t Shouldn’t Do Compression
• Kafka brokers are running a new version
• Message format has been set to the new version
• Clients haven’t upgraded
Up Conversion Down Conversion
• Kafka brokers are running a new version
• Message format is set to an older version due to
clients
• Producer clients update to new version
Request Timing
• Remote – Waiting for other brokers
• Response Queue – Waiting to send
• Response Send - Send to client
• Total – Request handling, end to end
• Request Queue – Waiting to process
• Local – Work local to the broker
Request Timing
EXAMPLE: PRODUCE TOTAL TIME
Request Timing
EXAMPLE: PRODUCE LOCAL TIME
Request Timing
EXAMPLE: PRODUCE REMOTE TIME
Thank you?
What’s Missing?
Availability
Monitoring
• SLO, part 2
• Measured externally
• Client focused
• github.com/linkedin/kafka-monitor
Operating System
And Hardware
Metrics
• What do they mean?
• What application is causing it?
• Don’t alert unless:
• 100% clear signal
• 100% clear response
Capacity
Planning
• Plan in advance
• Multi-factor
• Don’t alert for capacity
Capacity
Metrics
• Request Handler Idle Ratio
• Disk Utilization
• Partition Count
• Network Utilization
Wrapping Up
If You Remember Nothing Else…
• Define your service level objectives
• Monitor your service level objectives
• Metrics that cover many problems are noisy
• Buy Kafka: The Definitive Guide
Getting (and Giving) Help
• Kafka Monitor
• https://github.com/linkedin/kafka-monitor
• Burrow
• https://github.com/linkedin/Burrow
• Cruise Control
• https://github.com/linkedin/cruise-control
• kafka-tools
• https://github.com/linkedin/kafka-tools
LinkedIn Open Source Get Involved
• Community
• users@kafka.apache.org
• dev@kafka.apache.org
• Bugs and Work:
• https://issues.apache.org/jira/projects/KAFKA
Thank you

More Related Content

What's hot

Real-Time Dynamic Data Export Using the Kafka Ecosystem
Real-Time Dynamic Data Export Using the Kafka EcosystemReal-Time Dynamic Data Export Using the Kafka Ecosystem
Real-Time Dynamic Data Export Using the Kafka Ecosystem
confluent
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer
confluent
 
Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafka
confluent
 

What's hot (20)

Real-Time Dynamic Data Export Using the Kafka Ecosystem
Real-Time Dynamic Data Export Using the Kafka EcosystemReal-Time Dynamic Data Export Using the Kafka Ecosystem
Real-Time Dynamic Data Export Using the Kafka Ecosystem
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer
 
The New Way of Configuring Grace Periods for Windowed Operations in Kafka Str...
The New Way of Configuring Grace Periods for Windowed Operations in Kafka Str...The New Way of Configuring Grace Periods for Windowed Operations in Kafka Str...
The New Way of Configuring Grace Periods for Windowed Operations in Kafka Str...
 
How did we move the mountain? - Migrating 1 trillion+ messages per day across...
How did we move the mountain? - Migrating 1 trillion+ messages per day across...How did we move the mountain? - Migrating 1 trillion+ messages per day across...
How did we move the mountain? - Migrating 1 trillion+ messages per day across...
 
The Road Most Traveled: A Kafka Story | Heikki Nousiainen, Aiven
The Road Most Traveled: A Kafka Story | Heikki Nousiainen, AivenThe Road Most Traveled: A Kafka Story | Heikki Nousiainen, Aiven
The Road Most Traveled: A Kafka Story | Heikki Nousiainen, Aiven
 
Event Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on HerokuEvent Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on Heroku
 
Monitoring Apache Kafka
Monitoring Apache KafkaMonitoring Apache Kafka
Monitoring Apache Kafka
 
Building Event Streaming Microservices with Spring Boot and Apache Kafka | Ja...
Building Event Streaming Microservices with Spring Boot and Apache Kafka | Ja...Building Event Streaming Microservices with Spring Boot and Apache Kafka | Ja...
Building Event Streaming Microservices with Spring Boot and Apache Kafka | Ja...
 
Confluent On Azure: Why you should add Confluent to your Azure toolkit | Alic...
Confluent On Azure: Why you should add Confluent to your Azure toolkit | Alic...Confluent On Azure: Why you should add Confluent to your Azure toolkit | Alic...
Confluent On Azure: Why you should add Confluent to your Azure toolkit | Alic...
 
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...
 
Integrating Apache Kafka Into Your Environment
Integrating Apache Kafka Into Your EnvironmentIntegrating Apache Kafka Into Your Environment
Integrating Apache Kafka Into Your Environment
 
Building a Web Application with Kafka as your Database
Building a Web Application with Kafka as your DatabaseBuilding a Web Application with Kafka as your Database
Building a Web Application with Kafka as your Database
 
Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)
 
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
 
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
 
Apache Kafka® Delivers a Single Source of Truth for The New York Times
Apache Kafka® Delivers a Single Source of Truth for The New York TimesApache Kafka® Delivers a Single Source of Truth for The New York Times
Apache Kafka® Delivers a Single Source of Truth for The New York Times
 
How to over-engineer things and have fun? | Oto Brglez, OPALAB
How to over-engineer things and have fun? | Oto Brglez, OPALABHow to over-engineer things and have fun? | Oto Brglez, OPALAB
How to over-engineer things and have fun? | Oto Brglez, OPALAB
 
APAC Kafka Summit - Best Of
APAC Kafka Summit - Best Of APAC Kafka Summit - Best Of
APAC Kafka Summit - Best Of
 
Leveraging services in stream processor apps at Ticketmaster (Derek Cline, Ti...
Leveraging services in stream processor apps at Ticketmaster (Derek Cline, Ti...Leveraging services in stream processor apps at Ticketmaster (Derek Cline, Ti...
Leveraging services in stream processor apps at Ticketmaster (Derek Cline, Ti...
 
Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafka
 

Similar to URP? Excuse You! The Three Metrics You Have to Know

Design Reviews for Operations - Velocity Europe 2014
Design Reviews for Operations - Velocity Europe 2014Design Reviews for Operations - Velocity Europe 2014
Design Reviews for Operations - Velocity Europe 2014
Mandi Walls
 

Similar to URP? Excuse You! The Three Metrics You Have to Know (20)

Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Resilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes BackResilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes Back
 
Make It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version ControlMake It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version Control
 
Asynchronous programming using CompletableFutures in Java
Asynchronous programming using CompletableFutures in JavaAsynchronous programming using CompletableFutures in Java
Asynchronous programming using CompletableFutures in Java
 
Fault Tolerance in Distributed Environment
Fault Tolerance in Distributed EnvironmentFault Tolerance in Distributed Environment
Fault Tolerance in Distributed Environment
 
Production Ready Microservices at Scale
Production Ready Microservices at ScaleProduction Ready Microservices at Scale
Production Ready Microservices at Scale
 
Benchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and ResultsBenchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and Results
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
Client Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right Way
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservices
 
Adding Real-time Features to PHP Applications
Adding Real-time Features to PHP ApplicationsAdding Real-time Features to PHP Applications
Adding Real-time Features to PHP Applications
 
Design Review Best Practices - SREcon 2014
Design Review Best Practices - SREcon 2014Design Review Best Practices - SREcon 2014
Design Review Best Practices - SREcon 2014
 
Continuous Delivery for the Rest of Us
Continuous Delivery for the Rest of UsContinuous Delivery for the Rest of Us
Continuous Delivery for the Rest of Us
 
CoAP Talk
CoAP TalkCoAP Talk
CoAP Talk
 
Design Reviews for Operations - Velocity Europe 2014
Design Reviews for Operations - Velocity Europe 2014Design Reviews for Operations - Velocity Europe 2014
Design Reviews for Operations - Velocity Europe 2014
 
Stream Processing @ Lyft
Stream Processing @ LyftStream Processing @ Lyft
Stream Processing @ Lyft
 
Project Sherpa: How RightScale Went All in on Docker
Project Sherpa: How RightScale Went All in on DockerProject Sherpa: How RightScale Went All in on Docker
Project Sherpa: How RightScale Went All in on Docker
 

More from confluent

More from confluent (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

URP? Excuse You! The Three Metrics You Have to Know