Badai Aqrandista, Confluent, Senior Technical Support Engineer
This session will be about a common issue in the Kafka Producer: producer batch expiry. We will be discussing the Kafka Producer internals, its common causes, such as a slow network or small batching, and how to overcome them. We will also be sharing some examples along the way!
https://www.meetup.com/apache-kafka-sydney/events/279651982/
Nell’iperspazio con Rocket: il Framework Web di Rust!
Common issues with Apache Kafka® Producer
1. Common issues with Apache
Kafka® Producer
Badai Aqrandista, Senior Technical Support Engineer
2. Introduction
2
• My name is BADAI AQRANDISTA
• I started as a web developer, building website with Perl
and PHP in 2005.
• Experience supporting applications on Linux/UNIX
environment, from hotel booking engine,
telecommunication billing system, and mining equipment
monitoring system.
• Currently working for Confluent as Senior Technical
Support Engineer.
3. Kafka in a nutshell
3
• Kafka is a Pub/Sub system
• Kafka Producer sends record into Kafka
broker
• Kafka Consumer fetches record from
Kafka broker
• Kafka broker persists any data it receives
until retention period expires
PRODUCER CONSUME
R
5. Kafka Producer Internals
5
• KafkaProducer API:
• public Future<RecordMetadata> send(ProducerRecord<K,V> record)
• public Future<RecordMetadata> send(ProducerRecord<K,V> record, Callback callback)
• KafkaProducer#send method is asynchronous.
• It does not immediately send the record to Kafka broker.
• It puts the record in an internal queue and an internal queue will send multiple records as a
batch.
Batch
Record
Key
Value
Record
Key
Value
Record
Key
Value
6. Kafka Producer Internals
6
• Each Kafka Producer batch corresponds to a partitions.
• Kafka Producer determines the batch to append a record to based on the record key.
• If record key is “null”, Kafka Producer will choose the batch randomly.
• If record key is not “null”, Kafka Producer will use the hash of the record key to determine
the partition number.
• One or more batches are sent to the Kafka broker in a PRODUCE request.
7. Kafka Producer Internals
7
• Kafka Producer internal thread sends a batch to Kafka broker based on these
configuration:
• “batch.size” – defaults to 16 kB
• “linger.ms” – defaults to 0
• So, Kafka Producer internal thread sends a batch to Kafka broker when:
• The total size of records in the batch exceeds “batch.size”, or
• The time since batch creation exceeds “linger.ms”, or
• Kafka Producer ”flush()” method is called (directly or indirectly via “close()”).
• Kafka Producer only creates one connection to each broker.
• In the end, every batch for a Kafka broker must be sent sequentially through this one
connection.
• The maximum number of batches sent to each broker at any one time is controlled by
“max.in.flight.requests.per.connection”, which defaults to 5.
9. Kafka Producer Issues
9
1. Failure to connect to Kafka broker
2. Record is too large
3. Batch expires before sending
4. Not enough replicas error
10. Failure to connect to Kafka broker
10
• This error is not obvious, but it means failure to connect to Kafka broker.
• The error message looks like this:
• [2021-08-02 12:57:44,097] WARN [Producer clientId=producer-1] Connection to node -1
(kafka1/172.20.0.6:9093) could not be established. Broker may not be available.
(org.apache.kafka.clients.NetworkClient)
• How to fix this:
• Check the broker configuration to confirm the listener port and security protocol
• Check the hostname or the IP address of the broker
• Confirm that Kafka Producer’s bootstrap.server configuration is correct
• Confirm that connectivity exists between Kafka Producer’s host and Kafka broker hosts with commands
such as:
• ping {BROKER_HOST}
• nc {BROKER_HOST} {BROKER PORT}
• openssl s_client -connect {BROKER_HOST}:{BROKER_PORT}
11. Record is too large
11
• This error is because the record size is greater than “max.request.size” configuration, which
defaults to 1048576 (1 MB).
• The error message is like this:
• org.apache.kafka.common.errors.RecordTooLargeException: The message is 1600088 bytes when
serialized which is larger than 1048576, which is the value of the max.request.size configuration.
• How to fix it:
• Reduce the record size. This requires a change in the application that generates the record.
• If you cannot reduce the record size, you can increase producer configuration “max.request.size”. If you
do this, you also need to increase topic configuration “max.message.bytes”.
• Note: “max.request.size” is the maximum request size AFTER serialization but BEFORE
compression. So, setting compression will not fix this.
12. Batch expires before sending
12
• This error is a symptom of slow transfer time (on network) or slow processing (on Kafka
broker).
• The error looks like this:
• org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for test1-0:1500 ms has
passed since batch creation
• Sanity checks:
• Is the topic partition online? Topic partition is online if one or more Kafka brokers hosting the replicas
are online.
• Use “kafka-topics --bootstrap-server {BROKER HOST:PORT} --describe --topic {TOPIC NAME}”
• “delivery.timeout.ms” – An upper bound on the time to report success or failure after a call to send()
returns.
• The default value is 120000 ms (2 minutes).
• If ”delivery.timeout.ms” is set to a very low value, it can cause batches to be expired too early.
• “batch.size” – The maximum size of a record batch.
• The default value is 16384 bytes (16 kB).
• If the message size is large, this configuration may need to be increased to allow more records per
batch. More records per batch means higher throughput and lower latency per record.
13. Batch expires before sending
13
• How to investigate this issue (cont’d):
• First, we need to identify whether this is caused by slow transfer time or slow processing.
• To check if it is slow transfer time, execute “ping {BROKER HOST}” from the producer host. The round trip time
(RTT) should be reasonable. For example: If both producer and Kafka brokers are in the same data center, the
RTT should be less than 10 ms, mostly should be under 1 ms.
• If ”ping” result is good (i.e. consistently under 10 ms with 0% packet loss), then network latency is unlikely
to be the cause.
• To check if it is slow processing, check the following on Kafka brokers:
• Number of connections on the Kafka broker with “netstat -n | grep 9092 | wc -l”. More than 1000
connections is usually too high and can cause slow processing or connectivity issue.
• Number of topic partitions per broker. More than 1500 partitions per broker is usually too high and can
cause slow processing. Check it with “kafka-topics --describe | awk ‘{print $5, $6}’ | sort | uniq –c”.
• If Kafka broker host has enough CPU and memory, then you can increment “num.replica.fetchers” to 2 or 3 to allow
more partitions per broker.
• Inter-broker ”ping” latency. If the brokers are running on multiple data center (e.g. multiple Availability
Zone), then this may be significant contributor to produce latency.
• CPU usage of Kafka brokers. Following JMX metrics also show the internal thread idle-ness if you need:
• kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent – if this is low (< 0.5), that
means it needs higher “num.io.threads”, if CPU allows.
• kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent – if this is low (< 0.5), that means it
needs higher “num.network.threads”, if CPU allows.
14. Not enough replicas error
14
• This means the number of replicas in ISR is less than “min.insync.replicas” configuration.
• The error looks like this:
• [2021-08-03 01:34:05,077] WARN [Producer clientId=producer-1] Got error produce response with
correlation id 3 on topic-partition test2-0, retrying (2147483646 attempts left). Error:
NOT_ENOUGH_REPLICAS (org.apache.kafka.clients.producer.internals.Sender)
• This error occurs when:
• Topic replication factor is 3.
• Topic configuration includes “min.insync.replicas=2”.
• Producer uses “acks=all” configuration.
15. Not enough replica error
15
• What is ISR? Short for “In Sync Replicas”. This means the follower replicas that are in sync
with the leader. In other word, the follower replicas that have all records that the leader
replica has.
• How can a replica become out of sync? Either because the broker is offline or replication
failure or slow replication.
• How to fix this error:
• If it is out of sync because Kafka broker being offline, start the broker hosting the offline replicas.
• If it is out of sync because of replication failure, fix the failure. This is separate discussion. But the most
common one is disk failure. If the disk storing the replica data is full, Kafka broker will stop replicating all
replicas on that disk.
• If it is out of sync because of slow replication, fix the slow replication. This is also separate discussion.
But the most common cause is inter-broker latency or too many topic partitions per broker.