SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
IBM Event StreamsApache Kafka
© 2020 IBM Corporation
Help, My Kafka’s Broken
Emma Humber
Gantigmaa (Tina) Selenge
Kafka Summit 2020
Help my Kafka’s broken
Prepare
Review
© 2020 IBM Corporation 2
Include resource names and routing between
components in a topology diagram.
Collect logs and store JMX metrics published by Kafka
brokers, clients, the JVM and the OS.
Make logs useful.
Change one thing at a time.
Help my Kafka’s broken
Prepare
Review
© 2020 IBM Corporation 3
Use logs to create a timeline of events. Consult your
metrics.
Compare with a working system.
Collect data for root cause before restarting.
Understand what you need to find root cause next
time.
No, really,
what changed?
© 2020 IBM Corporation 4
© 2020 IBM Corporation 5
[2020-09-12 13:44:02,633] INFO Replica loaded for
partition asdf-0 with initial high watermark 0
(kafka.cluster.Replica)
Logs
© 2020 IBM Corporation 6
Find a log4j.properties
kafka-install/config/log4j.properties
Edit output location
log4j.appender.kafkaAppender.File= mylog123.log
Change log level
log4j.rootLogger=INFO, stdout, kafkaAppender
log4j.logger.kafka=DEBUG log4j.logger.org.apache.kafka=TRACE
Hangs
© 2020 IBM Corporation 7
Collect javacores at intervals.
kill -3 $JAVA_PID
jstack -l $JAVA_PID > javacore.txt
Look for threads that don’t change and deadlock alerts.
Javacore
© 2020 IBM Corporation 8
Memory
Excessive load or a memory leak?
Health Center to analyse heap dumps.
-XX:+HeapDumpOnOutOfMemoryError
© 2020 IBM Corporation 9
Memory
© 2020 IBM Corporation 10
Kubernetes can terminate containers that exceed configured resources.
Leave room for native memory allocation and pagecache, as well as heap.
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Garbage collection
Kafka can be sensitive to garbage
collection.
Unexplained delays in processing
increase message latency. Gaps seen
between log time stamps.
© 2020 IBM Corporation 11
Java Monitoring
12
ZooKeeper
Send 4 letter words to the ZooKeeper cluster to query state
echo “srvr” | nc <zookeeper_ip> 2181
Navigate the ZooKeeper tree
bin/zkCli.sh
If replication is stuck, consider deleting the zkNode representing the controller to
trigger re-election.
© 2020 IBM Corporation 13
Monitoring
© 2020 IBM Corporation 14
Broker Broker
Monitoring
Metrics
System
© 2020 IBM Corporation 15
Broker jmx_exporter
server
Prometheus
Alert Manager
Grafana
PagerDuty
Application
Producer
Consumer
jmx_exporter
server
External
Monitoring
Kafka Cluster
JVM
JVM
JVM
Monitoring
Metrics
System
© 2020 IBM Corporation 16
Start with a few key metrics.
Alert on carefully selected, key data.
Refine metrics and alerts when you encounter a
problem.
Watch for no metrics!
Monitoring
Metrics
System
© 2020 IBM Corporation 17
Network traffic.
CPU and I/O latency.
Memory allocation and garbage collection.
Disk capacity and latency.
Broker metrics
Partitions
Throughput
Balance
© 2020 IBM Corporation 18
Under replicated partition count.
kafka.server:type=ReplicaManager,name=UnderReplicate
dPartitions
Fewer than minimum in sync replica.
kafka.server:type=ReplicaManager,name=UnderMinIsrPar
titionCount
Offline partitions have no leader.
kafka.controller:
type=KafkaController,name=OfflinePartitionsCount
Broker metrics
Partitions
Throughput
Balance
© 2020 IBM Corporation 19
Shows overall performance and health of your cluster.
kafka.server:type=BrokerTopicMetrics,name=
BytesInPerSec
BytesOutPerSec
ReplicationBytesInPerSec
ReplicationBytesOutPerSec
Client metrics
Producer
Consumer
Custom
© 2020 IBM Corporation 20
Monitor metrics showing the trend of flow rates and
latency.
kafka.producer:type=producer-metrics,name=
record-send-rate
record-error-rate
request-rate
request-latency-avg
response-rate
io-wait-time-ns-avg
Client metrics
Producer
Consumer
Custom
© 2020 IBM Corporation 21
Understand what is an acceptable lag.
kafka.consumer:type=consumer-fetch-manager-
metrics,name=
records-lag
records-lead-min
records-consumed-rate
kafka-consumer-groups.sh
Client metrics
Producer
Consumer
Custom
© 2020 IBM Corporation 22
Consumer rebalance is a stop the world operation.
kafka.consumer:type=consumer-coordinator-
metrics,name=
rebalance-rate-per-hour
rebalance-latency-avg
Client metrics
Producer
Consumer
Custom
© 2020 IBM Corporation 23
Where time being spent.
Number of active consumers in a consumer group.
Idle and blocking threads.
End to end latency.
Summary
© 2020 IBM Corporation 24
Monitoring can get you ahead of problem before they happen.
Start with small set of key metrics.
Alert on carefully selected metrics and avoid the bad practices.
Important to monitor the OS level metrics.
Thank you
Emma Humber
Support Lead - IBM Event Streams
—
emma.humber@uk.ibm.com
Gantigmaa(Tina) Selenge
DevOps Engineer – IBM Event Streams
—
gselenge@uk.ibm.com
© Copyright IBM Corporation 2020. All rights reserved. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of
any kind, express or implied. Any statement of direction represents IBM’s current intent, is subject to change or withdrawal, and represent only goals and objectives. IBM, the IBM logo, and
ibm.com are trademarks of IBM Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM
trademarks is available at Copyright and trademark information.
© 2020 IBM Corporation 25

Weitere ähnliche Inhalte

Was ist angesagt?

Digital Transformation in Healthcare with Kafka—Building a Low Latency Data P...
Digital Transformation in Healthcare with Kafka—Building a Low Latency Data P...Digital Transformation in Healthcare with Kafka—Building a Low Latency Data P...
Digital Transformation in Healthcare with Kafka—Building a Low Latency Data P...
confluent
 
Creating an Elastic Platform Using Kafka and Microservices in OpenShift
Creating an Elastic Platform Using Kafka and Microservices in OpenShift Creating an Elastic Platform Using Kafka and Microservices in OpenShift
Creating an Elastic Platform Using Kafka and Microservices in OpenShift
confluent
 
Introducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building SocietyIntroducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building Society
confluent
 
Technical Deep Dive: Using Apache Kafka to Optimize Real-Time Analytics in Fi...
Technical Deep Dive: Using Apache Kafka to Optimize Real-Time Analytics in Fi...Technical Deep Dive: Using Apache Kafka to Optimize Real-Time Analytics in Fi...
Technical Deep Dive: Using Apache Kafka to Optimize Real-Time Analytics in Fi...
confluent
 

Was ist angesagt? (20)

Application modernization patterns with apache kafka, debezium, and kubernete...
Application modernization patterns with apache kafka, debezium, and kubernete...Application modernization patterns with apache kafka, debezium, and kubernete...
Application modernization patterns with apache kafka, debezium, and kubernete...
 
Cisco’s E-Commerce Transformation Using Kafka
Cisco’s E-Commerce Transformation Using Kafka Cisco’s E-Commerce Transformation Using Kafka
Cisco’s E-Commerce Transformation Using Kafka
 
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
 
Apache Kafka Architectures and Fundamentals
Apache Kafka Architectures and FundamentalsApache Kafka Architectures and Fundamentals
Apache Kafka Architectures and Fundamentals
 
Digital Transformation in Healthcare with Kafka—Building a Low Latency Data P...
Digital Transformation in Healthcare with Kafka—Building a Low Latency Data P...Digital Transformation in Healthcare with Kafka—Building a Low Latency Data P...
Digital Transformation in Healthcare with Kafka—Building a Low Latency Data P...
 
Transform Your Mainframe and IBM i Data for the Cloud with Precisely and Apac...
Transform Your Mainframe and IBM i Data for the Cloud with Precisely and Apac...Transform Your Mainframe and IBM i Data for the Cloud with Precisely and Apac...
Transform Your Mainframe and IBM i Data for the Cloud with Precisely and Apac...
 
Kafka Summit NYC 2017 - Achieving Predictability and Compliance with BNY Mell...
Kafka Summit NYC 2017 - Achieving Predictability and Compliance with BNY Mell...Kafka Summit NYC 2017 - Achieving Predictability and Compliance with BNY Mell...
Kafka Summit NYC 2017 - Achieving Predictability and Compliance with BNY Mell...
 
Why Kafka Works the Way It Does (And Not Some Other Way) | Tim Berglund, Conf...
Why Kafka Works the Way It Does (And Not Some Other Way) | Tim Berglund, Conf...Why Kafka Works the Way It Does (And Not Some Other Way) | Tim Berglund, Conf...
Why Kafka Works the Way It Does (And Not Some Other Way) | Tim Berglund, Conf...
 
Building Event-Driven Services with Apache Kafka
Building Event-Driven Services with Apache KafkaBuilding Event-Driven Services with Apache Kafka
Building Event-Driven Services with Apache Kafka
 
Creating an Elastic Platform Using Kafka and Microservices in OpenShift
Creating an Elastic Platform Using Kafka and Microservices in OpenShift Creating an Elastic Platform Using Kafka and Microservices in OpenShift
Creating an Elastic Platform Using Kafka and Microservices in OpenShift
 
Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...
Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...
Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...
 
Introducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building SocietyIntroducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building Society
 
Building Value - Understanding the TCO and ROI of Apache Kafka & Confluent
Building Value  - Understanding the TCO and ROI of Apache Kafka & ConfluentBuilding Value  - Understanding the TCO and ROI of Apache Kafka & Confluent
Building Value - Understanding the TCO and ROI of Apache Kafka & Confluent
 
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
 
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
 
Technical Deep Dive: Using Apache Kafka to Optimize Real-Time Analytics in Fi...
Technical Deep Dive: Using Apache Kafka to Optimize Real-Time Analytics in Fi...Technical Deep Dive: Using Apache Kafka to Optimize Real-Time Analytics in Fi...
Technical Deep Dive: Using Apache Kafka to Optimize Real-Time Analytics in Fi...
 
Data Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, HowData Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, How
 
Kafka for connected vehicle research | Pavle Bujanovic, Federal Highway Admin...
Kafka for connected vehicle research | Pavle Bujanovic, Federal Highway Admin...Kafka for connected vehicle research | Pavle Bujanovic, Federal Highway Admin...
Kafka for connected vehicle research | Pavle Bujanovic, Federal Highway Admin...
 
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...
 
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
 

Ähnlich wie Help, My Kafka is Broken! (Emma Humber & Gantigmaa Selenge, IBM) Kafka Summit 2020

Ähnlich wie Help, My Kafka is Broken! (Emma Humber & Gantigmaa Selenge, IBM) Kafka Summit 2020 (20)

Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
 
WebSphere Technical University: Top WebSphere Problem Determination Features
WebSphere Technical University: Top WebSphere Problem Determination FeaturesWebSphere Technical University: Top WebSphere Problem Determination Features
WebSphere Technical University: Top WebSphere Problem Determination Features
 
20200113 - IBM Cloud Côte d'Azur - DeepDive Kubernetes
20200113 - IBM Cloud Côte d'Azur - DeepDive Kubernetes20200113 - IBM Cloud Côte d'Azur - DeepDive Kubernetes
20200113 - IBM Cloud Côte d'Azur - DeepDive Kubernetes
 
Writing Kafka applications without Kafka server access | Zoltan Balogh, IBM U...
Writing Kafka applications without Kafka server access | Zoltan Balogh, IBM U...Writing Kafka applications without Kafka server access | Zoltan Balogh, IBM U...
Writing Kafka applications without Kafka server access | Zoltan Balogh, IBM U...
 
Bp307 Practical Solutions for Connections Administrators, tips and scrips for...
Bp307 Practical Solutions for Connections Administrators, tips and scrips for...Bp307 Practical Solutions for Connections Administrators, tips and scrips for...
Bp307 Practical Solutions for Connections Administrators, tips and scrips for...
 
S200515 storage-insights-ist2020-v2001d
S200515 storage-insights-ist2020-v2001dS200515 storage-insights-ist2020-v2001d
S200515 storage-insights-ist2020-v2001d
 
Cloud-native Java EE-volution
Cloud-native Java EE-volutionCloud-native Java EE-volution
Cloud-native Java EE-volution
 
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
 
Beginner's Guide to High Availability for Postgres
Beginner's Guide to High Availability for PostgresBeginner's Guide to High Availability for Postgres
Beginner's Guide to High Availability for Postgres
 
Application Modernization with PKS / Kubernetes
Application Modernization with PKS / KubernetesApplication Modernization with PKS / Kubernetes
Application Modernization with PKS / Kubernetes
 
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...
 
Introduction to ibm cloud paks concept license and minimum config public
Introduction to ibm cloud paks concept license and minimum config publicIntroduction to ibm cloud paks concept license and minimum config public
Introduction to ibm cloud paks concept license and minimum config public
 
Beginners Guide to High Availability for Postgres
Beginners Guide to High Availability for PostgresBeginners Guide to High Availability for Postgres
Beginners Guide to High Availability for Postgres
 
Running Stateful Apps on Kubernetes
Running Stateful Apps on KubernetesRunning Stateful Apps on Kubernetes
Running Stateful Apps on Kubernetes
 
JSpring Virtual 2020 - Reacting to an event-driven world
JSpring Virtual 2020 - Reacting to an event-driven worldJSpring Virtual 2020 - Reacting to an event-driven world
JSpring Virtual 2020 - Reacting to an event-driven world
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 
MySQL Database Architectures - 2022-08
MySQL Database Architectures - 2022-08MySQL Database Architectures - 2022-08
MySQL Database Architectures - 2022-08
 
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
 
STE_DailyHC_TSMV6.pptx
STE_DailyHC_TSMV6.pptxSTE_DailyHC_TSMV6.pptx
STE_DailyHC_TSMV6.pptx
 
Impact2014 session # 1523 performance optimization using ibm java on z and w...
Impact2014  session # 1523 performance optimization using ibm java on z and w...Impact2014  session # 1523 performance optimization using ibm java on z and w...
Impact2014 session # 1523 performance optimization using ibm java on z and w...
 

Mehr von HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 

Mehr von HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Kürzlich hochgeladen (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Help, My Kafka is Broken! (Emma Humber & Gantigmaa Selenge, IBM) Kafka Summit 2020

  • 1. IBM Event StreamsApache Kafka © 2020 IBM Corporation Help, My Kafka’s Broken Emma Humber Gantigmaa (Tina) Selenge Kafka Summit 2020
  • 2. Help my Kafka’s broken Prepare Review © 2020 IBM Corporation 2 Include resource names and routing between components in a topology diagram. Collect logs and store JMX metrics published by Kafka brokers, clients, the JVM and the OS. Make logs useful. Change one thing at a time.
  • 3. Help my Kafka’s broken Prepare Review © 2020 IBM Corporation 3 Use logs to create a timeline of events. Consult your metrics. Compare with a working system. Collect data for root cause before restarting. Understand what you need to find root cause next time.
  • 4. No, really, what changed? © 2020 IBM Corporation 4
  • 5. © 2020 IBM Corporation 5 [2020-09-12 13:44:02,633] INFO Replica loaded for partition asdf-0 with initial high watermark 0 (kafka.cluster.Replica)
  • 6. Logs © 2020 IBM Corporation 6 Find a log4j.properties kafka-install/config/log4j.properties Edit output location log4j.appender.kafkaAppender.File= mylog123.log Change log level log4j.rootLogger=INFO, stdout, kafkaAppender log4j.logger.kafka=DEBUG log4j.logger.org.apache.kafka=TRACE
  • 7. Hangs © 2020 IBM Corporation 7 Collect javacores at intervals. kill -3 $JAVA_PID jstack -l $JAVA_PID > javacore.txt Look for threads that don’t change and deadlock alerts.
  • 8. Javacore © 2020 IBM Corporation 8
  • 9. Memory Excessive load or a memory leak? Health Center to analyse heap dumps. -XX:+HeapDumpOnOutOfMemoryError © 2020 IBM Corporation 9
  • 10. Memory © 2020 IBM Corporation 10 Kubernetes can terminate containers that exceed configured resources. Leave room for native memory allocation and pagecache, as well as heap. Last State: Terminated Reason: OOMKilled Exit Code: 137
  • 11. Garbage collection Kafka can be sensitive to garbage collection. Unexplained delays in processing increase message latency. Gaps seen between log time stamps. © 2020 IBM Corporation 11
  • 13. ZooKeeper Send 4 letter words to the ZooKeeper cluster to query state echo “srvr” | nc <zookeeper_ip> 2181 Navigate the ZooKeeper tree bin/zkCli.sh If replication is stuck, consider deleting the zkNode representing the controller to trigger re-election. © 2020 IBM Corporation 13
  • 14. Monitoring © 2020 IBM Corporation 14
  • 15. Broker Broker Monitoring Metrics System © 2020 IBM Corporation 15 Broker jmx_exporter server Prometheus Alert Manager Grafana PagerDuty Application Producer Consumer jmx_exporter server External Monitoring Kafka Cluster JVM JVM JVM
  • 16. Monitoring Metrics System © 2020 IBM Corporation 16 Start with a few key metrics. Alert on carefully selected, key data. Refine metrics and alerts when you encounter a problem. Watch for no metrics!
  • 17. Monitoring Metrics System © 2020 IBM Corporation 17 Network traffic. CPU and I/O latency. Memory allocation and garbage collection. Disk capacity and latency.
  • 18. Broker metrics Partitions Throughput Balance © 2020 IBM Corporation 18 Under replicated partition count. kafka.server:type=ReplicaManager,name=UnderReplicate dPartitions Fewer than minimum in sync replica. kafka.server:type=ReplicaManager,name=UnderMinIsrPar titionCount Offline partitions have no leader. kafka.controller: type=KafkaController,name=OfflinePartitionsCount
  • 19. Broker metrics Partitions Throughput Balance © 2020 IBM Corporation 19 Shows overall performance and health of your cluster. kafka.server:type=BrokerTopicMetrics,name= BytesInPerSec BytesOutPerSec ReplicationBytesInPerSec ReplicationBytesOutPerSec
  • 20. Client metrics Producer Consumer Custom © 2020 IBM Corporation 20 Monitor metrics showing the trend of flow rates and latency. kafka.producer:type=producer-metrics,name= record-send-rate record-error-rate request-rate request-latency-avg response-rate io-wait-time-ns-avg
  • 21. Client metrics Producer Consumer Custom © 2020 IBM Corporation 21 Understand what is an acceptable lag. kafka.consumer:type=consumer-fetch-manager- metrics,name= records-lag records-lead-min records-consumed-rate kafka-consumer-groups.sh
  • 22. Client metrics Producer Consumer Custom © 2020 IBM Corporation 22 Consumer rebalance is a stop the world operation. kafka.consumer:type=consumer-coordinator- metrics,name= rebalance-rate-per-hour rebalance-latency-avg
  • 23. Client metrics Producer Consumer Custom © 2020 IBM Corporation 23 Where time being spent. Number of active consumers in a consumer group. Idle and blocking threads. End to end latency.
  • 24. Summary © 2020 IBM Corporation 24 Monitoring can get you ahead of problem before they happen. Start with small set of key metrics. Alert on carefully selected metrics and avoid the bad practices. Important to monitor the OS level metrics.
  • 25. Thank you Emma Humber Support Lead - IBM Event Streams — emma.humber@uk.ibm.com Gantigmaa(Tina) Selenge DevOps Engineer – IBM Event Streams — gselenge@uk.ibm.com © Copyright IBM Corporation 2020. All rights reserved. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of any kind, express or implied. Any statement of direction represents IBM’s current intent, is subject to change or withdrawal, and represent only goals and objectives. IBM, the IBM logo, and ibm.com are trademarks of IBM Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available at Copyright and trademark information. © 2020 IBM Corporation 25