SlideShare a Scribd company logo
1 of 68
Download to read offline
Troubleshooting Kafka’s socket server
from incident to resolution
Joel Koshy
LinkedIn
The incident
Occurred a few days after upgrading to pick up quotas and SSL
Multi-port
KAFKA-1809
KAFKA-1928
SSL
KAFKA-1690
x25 x38
October 13
Various quota patches
June 3April 5 August 18
The incident
Broker (which happened to be controller) failed in our queuing Kafka cluster
The incident
● Alerts fire; NOC engages SYSOPS/Kafka-SRE
● Kafka-SRE restarts the broker
● Broker failure does not generally cause prolonged application impact
○ but in this incident…
The incident
Multiple applications begin to report “issues”: socket timeouts to Kafka cluster
Posts search was one such
impacted application
The incident
Two brokers report high request and response queue sizes
The incident
Two brokers report high request queue size and request latencies
The incident
● Other observations
○ High CPU load on those brokers
○ Throughput degrades to ~ half the normal throughput
○ Tons of broken pipe exceptions in server logs
○ Application owners report socket timeouts in their logs
Remediation
Shifted site traffic to another data center
“Kafka outage ⇒ member impact
Multi-colo is critical!
Remediation
● Controller moves did not help
● Firewall the affected brokers
● The above helped, but cluster fell over again after dropping the rules
● Suspect misbehaving clients on broker failure
○ … but x25 never exhibited this issue
sudo iptables -A INPUT -p tcp --dport <broker-port> -s <other-broker> -j ACCEPT
sudo iptables -A INPUT -p tcp --dport <broker-port> -j DROP
“Good dashboards/alerts
Skilled operators
Clear communication
Audit/log all operations
CRM principles apply to operations
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
Move leaders
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
Firewall
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38
Rolling downgrade
Firewall
x25
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38
Rolling downgrade
x25
Move leaders
… oh and BTW
be careful when saving a lot of public-access/server logs:
● Can cause GC
[Times: user=0.39 sys=0.01, real=8.01 secs]
● Use ionice or rsync --bwlimit
low
low
high
The investigation
● Test cluster
○ Tried killing controller
○ Multiple rolling bounces
○ Could not reproduce
● Upgraded the queuing cluster to x38 again
○ Could not reproduce
● So nothing…
Attempts at reproducing the issue
Understanding queue backups…
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
Quota manager
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
Quota manager
New
connections
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
and then turn off
read interest from
that connection
(for ordering)
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quota
violated
+ quota-time
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quota
violated
+ quota-time
Await processor
+ response-queue-time
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quota
violated
+ quota-time
Await processor
+ response-queue-time
Write
response
+ response-send-time
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quota
violated
+ quota-time
Await processor
+ response-queue-time
Write
response
+ response-send-time
Turn read interest
back on from that
connection
Investigating high request times
● Total time is useful for monitoring
● but high total time is not necessarily bad
Investigating high request times
● Total time is useful for monitoring
● but high total time is not necessarily bad
Low
Investigating high request times
● Total time is useful for monitoring
● but high total time is not necessarily bad
Low
High
but “normal”
(purgatory)
Investigating high request times
● First look for high local time
○ then high response send time
■ then high remote (purgatory) time → generally non-issue (but caveats described later)
● High request queue/response queue times are effects, not causes
High local times during incident (e.g., fetch)
How are fetch requests handled?
● Get physical offsets to be read from local log during response
● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)
○ Maybe satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):
○ Record/update byte-rate of this client
○ Throttle the request on quota violation
Could these cause high local times?
● Get physical offsets to be read from local log during response
● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)
○ Satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):
○ Record/update byte-rate of this client
○ Throttle the request on quota violation
Not using acks -1
Should be fast
Maybe
Should be fast
Delayed outside
API thread
ISR churn?
● Low ZooKeeper write latencies
● Churn in this incident: effect of some other root cause
● Long request queues can cause churn
○ ⇒ follower fetches timeout
■ ⇒ fall out of ISR (ISR shrink happens asynchronously in separate thread)
○ Outstanding fetch catches up and ISR expands
ISR churn? … unlikely
High local times during incident (e.g., fetch)
Besides, fetch-consumer (not
just follower) has high local time
Could these cause high local times?
● Get physical offsets to be read from local log during response
● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)
○ Satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):
○ Record/update byte-rate of this client
○ Throttle the request on quota violation
Not using acks -1
Should be fast
Should be fast
Delayed outside
API thread
Test this…
Maintains byte-rate metrics on a per-client-id basis
2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0;
CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0
ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,
requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,
securityProtocol:PLAINTEXT,principal:ANONYMOUS
Quota metrics
??!
Quota metrics - a quick benchmark
for (clientId ← 0 until N) {
timer.time {
quotaMetrics.recordAndMaybeThrottle(sensorId, 0, DefaultCallBack)
}
}
Quota metrics - a quick benchmark
Quota metrics - a quick benchmark
Fixed in KAFKA-2664
meanwhile in our queuing cluster…
due to climbing
client-id counts
Rolling bounce of cluster forced the issue to recur on brokers that had high client-
id metric counts
○ Used jmxterm to check per-client-id metric counts before experiment
○ Hooked up profiler to verify during incident
■ Generally avoid profiling/heapdumps in production due to interference
○ Did not see in earlier rolling bounce due to only a few client-id metrics at the time
MACRO
● Observe week-over-week trends
● Formulate theories
● Test theory (micro-experiments)
● Deploy fix and validate
MICRO (live debugging)
● Instrumentation
● Attach profilers
● Take heapdumps
● Trace-level logs, tcpdump, etc.
Troubleshooting: macro vs micro
MACRO
● Observe week-over-week trends
● Formulate theories
● Test theory (micro-experiments)
● Deploy fix and validate
MICRO (live debugging)
● Instrumentation
● Attach profilers
● Take heapdumps
● Trace-level logs, tcpdump, etc.
Troubleshooting: macro vs micro
Generally more effective Sometimes warranted,
but invasive and tedious
How to fix high local times
● Optimize the request’s handling. For e.g.,:
○ cached topic metadata as opposed to ZooKeeper reads (see KAFKA-901)
○ and KAFKA-1356
● Make it asynchronous
○ E.g., we will do this for StopReplica in KAFKA-1911
● Put it in a purgatory (usually if response depends on some condition); but be
aware of the caveats:
○ Higher memory pressure if request purgatory size grows
○ Expired requests are handled in purgatory expiration thread (which is good)
○ but satisfied requests are handled in API thread of satisfying request ⇒ if a request satisfies
several delayed requests then local time can increase for the satisfying request
as for rogue clients…
2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0;
CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0
ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,
requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,
securityProtocol:PLAINTEXT,principal:ANONYMOUS
“Get apps to use wrapper libraries that
implement good client behavior,
shield from API changes and so on…
not done yet!
this lesson needs repeating...
After deploying the metrics fix to some clusters…
After deploying the metrics fix to some clusters…
Deployment
Persistent
URP
After deploying the metrics fix to some clusters…
Applications also begin to report
higher than usual consumer lag
Root cause: zero-copy broke for plaintext
Upgraded
cluster
Rolled
back
With fix
(KAFKA-2517)
The lesson...
● Request queue size
● Response queue sizes
● Request latencies:
○ Total time
○ Local time
○ Response send time
○ Remote time
● Request handler pool idle ratio
Monitor these closely!
Continuous validation on trunk
Any other high latency requests?
Image courtesy of ©Nevit Dilmen
Local times
ConsumerMetadata OffsetFetch
ControlledShutdown Offsets (by time)
Fetch Produce
LeaderAndIsr StopReplica (for delete=true)
TopicMetadata UpdateMetadata
OffsetCommit
These are (typically
1:N) broker-to-broker
requests
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Read bit off so ties
up at most one API
handler
Request queue
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Request queueBut if requesting broker
times out and retries…
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Request queueBut if requesting broker
times out and retries…
Processor Response queue
API handler
API layerNetwork layer
Broker-to-broker request latencies - less critical
Processor Response queue
API handler
Purgatory
Acceptor
Request queue
Processor Response queue
API handler
Configure socket timeouts
>> MAX(latency)
● Broker to controller
○ ControlledShutdown (KAFKA-1342)
● Controller to broker
○ StopReplica[delete = true] should be asynchronous (KAFKA-1911)
○ LeaderAndIsr: batch request - maybe worth optimizing or putting in a purgatory? Haven’t
looked closely yet…
Broker-to-broker request latency improvements
The end

More Related Content

What's hot

Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafkaemreakis
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database Systemconfluent
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisCapacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisHostedbyConfluent
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaJiangjie Qin
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafkaJiangjie Qin
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...Altinity Ltd
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...HostedbyConfluent
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent
 
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, ConfluentKafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, ConfluentHostedbyConfluent
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...HostedbyConfluent
 
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...confluent
 
Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Abdelkrim Hadjidj
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingRobert Metzger
 

What's hot (20)

Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisCapacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafka
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, ConfluentKafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
 
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
 
Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer Checkpointing
 

Similar to Troubleshooting Kafka's socket server: from incident to resolution

Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...HostedbyConfluent
 
Kafka elastic search meetup 09242018
Kafka elastic search meetup 09242018Kafka elastic search meetup 09242018
Kafka elastic search meetup 09242018Ying Xu
 
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorAljoscha Krettek
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Peter Bakas
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data AnalyticsAnkur Bansal
 
Cloud Performance Benchmarking
Cloud Performance BenchmarkingCloud Performance Benchmarking
Cloud Performance BenchmarkingSantanu Dey
 
Kafka Evaluation - High Throughout Message Queue
Kafka Evaluation - High Throughout Message QueueKafka Evaluation - High Throughout Message Queue
Kafka Evaluation - High Throughout Message QueueShafaq Abdullah
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
 
Multi-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation StrategiesMulti-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation StrategiesSagi Brody
 
Reactive Streams and RxJava2
Reactive Streams and RxJava2Reactive Streams and RxJava2
Reactive Streams and RxJava2Yakov Fain
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasMonal Daxini
 
Multi-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation StrategiesMulti-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation StrategiesLogan Best
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKevin Lynch
 
Kafka practical experience
Kafka practical experienceKafka practical experience
Kafka practical experienceRico Chen
 
3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging振东 刘
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexApache Apex
 

Similar to Troubleshooting Kafka's socket server: from incident to resolution (20)

Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
 
Kafka elastic search meetup 09242018
Kafka elastic search meetup 09242018Kafka elastic search meetup 09242018
Kafka elastic search meetup 09242018
 
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream Processor
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
 
Cloud Performance Benchmarking
Cloud Performance BenchmarkingCloud Performance Benchmarking
Cloud Performance Benchmarking
 
Kafka Evaluation - High Throughout Message Queue
Kafka Evaluation - High Throughout Message QueueKafka Evaluation - High Throughout Message Queue
Kafka Evaluation - High Throughout Message Queue
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Multi-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation StrategiesMulti-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation Strategies
 
Reactive Streams and RxJava2
Reactive Streams and RxJava2Reactive Streams and RxJava2
Reactive Streams and RxJava2
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
Multi-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation StrategiesMulti-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation Strategies
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the Datacenter
 
Kafka practical experience
Kafka practical experienceKafka practical experience
Kafka practical experience
 
3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 

Recently uploaded

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 

Recently uploaded (20)

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 

Troubleshooting Kafka's socket server: from incident to resolution

  • 1. Troubleshooting Kafka’s socket server from incident to resolution Joel Koshy LinkedIn
  • 2. The incident Occurred a few days after upgrading to pick up quotas and SSL Multi-port KAFKA-1809 KAFKA-1928 SSL KAFKA-1690 x25 x38 October 13 Various quota patches June 3April 5 August 18
  • 3. The incident Broker (which happened to be controller) failed in our queuing Kafka cluster
  • 4. The incident ● Alerts fire; NOC engages SYSOPS/Kafka-SRE ● Kafka-SRE restarts the broker ● Broker failure does not generally cause prolonged application impact ○ but in this incident…
  • 5. The incident Multiple applications begin to report “issues”: socket timeouts to Kafka cluster Posts search was one such impacted application
  • 6. The incident Two brokers report high request and response queue sizes
  • 7. The incident Two brokers report high request queue size and request latencies
  • 8. The incident ● Other observations ○ High CPU load on those brokers ○ Throughput degrades to ~ half the normal throughput ○ Tons of broken pipe exceptions in server logs ○ Application owners report socket timeouts in their logs
  • 9. Remediation Shifted site traffic to another data center “Kafka outage ⇒ member impact Multi-colo is critical!
  • 10. Remediation ● Controller moves did not help ● Firewall the affected brokers ● The above helped, but cluster fell over again after dropping the rules ● Suspect misbehaving clients on broker failure ○ … but x25 never exhibited this issue sudo iptables -A INPUT -p tcp --dport <broker-port> -s <other-broker> -j ACCEPT sudo iptables -A INPUT -p tcp --dport <broker-port> -j DROP
  • 11.
  • 12. “Good dashboards/alerts Skilled operators Clear communication Audit/log all operations CRM principles apply to operations
  • 13. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 x38 Rolling downgrade
  • 14. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 x38 Rolling downgrade Move leaders
  • 15. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 x38 Rolling downgrade Firewall
  • 16. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 Rolling downgrade Firewall x25
  • 17. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 Rolling downgrade x25 Move leaders
  • 18. … oh and BTW be careful when saving a lot of public-access/server logs: ● Can cause GC [Times: user=0.39 sys=0.01, real=8.01 secs] ● Use ionice or rsync --bwlimit low low high
  • 20. ● Test cluster ○ Tried killing controller ○ Multiple rolling bounces ○ Could not reproduce ● Upgraded the queuing cluster to x38 again ○ Could not reproduce ● So nothing… Attempts at reproducing the issue
  • 22. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory Quota manager
  • 23. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory Quota manager New connections
  • 24. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager and then turn off read interest from that connection (for ordering)
  • 25. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time
  • 26. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time
  • 27. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time + remote-time long-poll requests
  • 28. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time + remote-time long-poll requests Hold if quota violated + quota-time
  • 29. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time + remote-time long-poll requests Hold if quota violated + quota-time Await processor + response-queue-time
  • 30. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time + remote-time long-poll requests Hold if quota violated + quota-time Await processor + response-queue-time Write response + response-send-time
  • 31. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time + remote-time long-poll requests Hold if quota violated + quota-time Await processor + response-queue-time Write response + response-send-time Turn read interest back on from that connection
  • 32. Investigating high request times ● Total time is useful for monitoring ● but high total time is not necessarily bad
  • 33. Investigating high request times ● Total time is useful for monitoring ● but high total time is not necessarily bad Low
  • 34. Investigating high request times ● Total time is useful for monitoring ● but high total time is not necessarily bad Low High but “normal” (purgatory)
  • 35. Investigating high request times ● First look for high local time ○ then high response send time ■ then high remote (purgatory) time → generally non-issue (but caveats described later) ● High request queue/response queue times are effects, not causes
  • 36. High local times during incident (e.g., fetch)
  • 37. How are fetch requests handled? ● Get physical offsets to be read from local log during response ● If fetch from follower (i.e., replica fetch): ○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write) ○ Maybe satisfy eligible delayed produce requests (with acks -1) ● Else (i.e., consumer fetch): ○ Record/update byte-rate of this client ○ Throttle the request on quota violation
  • 38. Could these cause high local times? ● Get physical offsets to be read from local log during response ● If fetch from follower (i.e., replica fetch): ○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write) ○ Satisfy eligible delayed produce requests (with acks -1) ● Else (i.e., consumer fetch): ○ Record/update byte-rate of this client ○ Throttle the request on quota violation Not using acks -1 Should be fast Maybe Should be fast Delayed outside API thread
  • 40. ● Low ZooKeeper write latencies ● Churn in this incident: effect of some other root cause ● Long request queues can cause churn ○ ⇒ follower fetches timeout ■ ⇒ fall out of ISR (ISR shrink happens asynchronously in separate thread) ○ Outstanding fetch catches up and ISR expands ISR churn? … unlikely
  • 41. High local times during incident (e.g., fetch) Besides, fetch-consumer (not just follower) has high local time
  • 42. Could these cause high local times? ● Get physical offsets to be read from local log during response ● If fetch from follower (i.e., replica fetch): ○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write) ○ Satisfy eligible delayed produce requests (with acks -1) ● Else (i.e., consumer fetch): ○ Record/update byte-rate of this client ○ Throttle the request on quota violation Not using acks -1 Should be fast Should be fast Delayed outside API thread Test this…
  • 43. Maintains byte-rate metrics on a per-client-id basis 2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0; CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0 ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589, requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0, securityProtocol:PLAINTEXT,principal:ANONYMOUS Quota metrics ??!
  • 44. Quota metrics - a quick benchmark for (clientId ← 0 until N) { timer.time { quotaMetrics.recordAndMaybeThrottle(sensorId, 0, DefaultCallBack) } }
  • 45. Quota metrics - a quick benchmark
  • 46. Quota metrics - a quick benchmark Fixed in KAFKA-2664
  • 47. meanwhile in our queuing cluster… due to climbing client-id counts
  • 48. Rolling bounce of cluster forced the issue to recur on brokers that had high client- id metric counts ○ Used jmxterm to check per-client-id metric counts before experiment ○ Hooked up profiler to verify during incident ■ Generally avoid profiling/heapdumps in production due to interference ○ Did not see in earlier rolling bounce due to only a few client-id metrics at the time
  • 49. MACRO ● Observe week-over-week trends ● Formulate theories ● Test theory (micro-experiments) ● Deploy fix and validate MICRO (live debugging) ● Instrumentation ● Attach profilers ● Take heapdumps ● Trace-level logs, tcpdump, etc. Troubleshooting: macro vs micro
  • 50. MACRO ● Observe week-over-week trends ● Formulate theories ● Test theory (micro-experiments) ● Deploy fix and validate MICRO (live debugging) ● Instrumentation ● Attach profilers ● Take heapdumps ● Trace-level logs, tcpdump, etc. Troubleshooting: macro vs micro Generally more effective Sometimes warranted, but invasive and tedious
  • 51. How to fix high local times ● Optimize the request’s handling. For e.g.,: ○ cached topic metadata as opposed to ZooKeeper reads (see KAFKA-901) ○ and KAFKA-1356 ● Make it asynchronous ○ E.g., we will do this for StopReplica in KAFKA-1911 ● Put it in a purgatory (usually if response depends on some condition); but be aware of the caveats: ○ Higher memory pressure if request purgatory size grows ○ Expired requests are handled in purgatory expiration thread (which is good) ○ but satisfied requests are handled in API thread of satisfying request ⇒ if a request satisfies several delayed requests then local time can increase for the satisfying request
  • 52. as for rogue clients… 2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0; CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0 ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589, requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0, securityProtocol:PLAINTEXT,principal:ANONYMOUS “Get apps to use wrapper libraries that implement good client behavior, shield from API changes and so on…
  • 53. not done yet! this lesson needs repeating...
  • 54. After deploying the metrics fix to some clusters…
  • 55. After deploying the metrics fix to some clusters… Deployment Persistent URP
  • 56. After deploying the metrics fix to some clusters… Applications also begin to report higher than usual consumer lag
  • 57. Root cause: zero-copy broke for plaintext Upgraded cluster Rolled back With fix (KAFKA-2517)
  • 59. ● Request queue size ● Response queue sizes ● Request latencies: ○ Total time ○ Local time ○ Response send time ○ Remote time ● Request handler pool idle ratio Monitor these closely!
  • 61. Any other high latency requests? Image courtesy of ©Nevit Dilmen
  • 62. Local times ConsumerMetadata OffsetFetch ControlledShutdown Offsets (by time) Fetch Produce LeaderAndIsr StopReplica (for delete=true) TopicMetadata UpdateMetadata OffsetCommit These are (typically 1:N) broker-to-broker requests
  • 63. API layerNetwork layer Broker-to-broker request latencies - less critical Processor Response queue API handler Purgatory Acceptor Read bit off so ties up at most one API handler Request queue
  • 64. API layerNetwork layer Broker-to-broker request latencies - less critical Processor Response queue API handler Purgatory Acceptor Request queueBut if requesting broker times out and retries…
  • 65. API layerNetwork layer Broker-to-broker request latencies - less critical Processor Response queue API handler Purgatory Acceptor Request queueBut if requesting broker times out and retries… Processor Response queue API handler
  • 66. API layerNetwork layer Broker-to-broker request latencies - less critical Processor Response queue API handler Purgatory Acceptor Request queue Processor Response queue API handler Configure socket timeouts >> MAX(latency)
  • 67. ● Broker to controller ○ ControlledShutdown (KAFKA-1342) ● Controller to broker ○ StopReplica[delete = true] should be asynchronous (KAFKA-1911) ○ LeaderAndIsr: batch request - maybe worth optimizing or putting in a purgatory? Haven’t looked closely yet… Broker-to-broker request latency improvements