We use machine learning to delve deep into the internals of how systems like Kafka work. In this talk I'll dive into what variables affect performance and reliability, including previously unknown leading indicators of major performance problems, failure conditions and how to tune for specific use cases. I'll cover some of the specific methodology we use, including Bayesian optimization, and reinforcement learning. I'll also talk about our own internal infrastructure that makes heavy use of Kafka and Kubernetes to deliver real-time predictions to our customers.
2. • Worked with 500+ large scale deployments
• Author of the Cassandra C++ driver
• Contributor to k8s, Cassandra, Lucene, Hadoop
• Designed some of the largest distributed
systems in existence
• Ran strategic pre-sales and product marketing
at DataStax
• Founder and CTO at SourceNinja
Matthew Stump
Co-Founder & CEO
3. Agenda
• Why another monitoring tool?
• What do we want as operators of systems like Kafka?
• How we used Kafka as the backbone of our product
• Our architecture, highlighting use of Kafka in Kubernetes
• How our ML models work and the types of models we use
• Example wins and use cases
• Tuning Kafka for large message sizes > 10MB
• Identifying replication groups in rebalancing storms
7. What do we want from our tools? To know…
•What changed?
•What’s caused it?
•How do I make it stop?
•How do I tune the system
for my application?
•What should I know but
don’t?
8. Why is this hard?
• Things often fail in coordination/cascades
• Disk, garbage collection, network, request latency, CPU
• One bad actor can take down entire distributed systems
• Existing systems look at 1 signal in a vacuum
• Bad/nonexistent documentation or tuning recommendations
• Information overload: too many knobs, too many metrics
• These systems are very complex; nobody knows how they really work
15. Types of ML Models
Unsupervised Supervised Reinforcement Learning
Clustering: has the workload
changed?
Clustering: stable vs. unstable
Binary LSTM: stable vs. unstable
Binary LSTM: are you this bug?
Multi-class LSTM: identify 1 of N
bad behaviors
Optimize a “score”: tune for
latency, cost, throughput
Meta-learn best resolution
16. Why Kafka?
• All of our workloads are inherently asynchronous
• Central communications bus between all of our workers
• Wildly different hardware requirements: GPU, CPU, memory
• Wildly different latencies for different workers
• Monte Carlo simulation > 60 seconds
• Clustering ≈ 2-3 milliseconds
• Decouple releases, test new ideas in low-risk ways
17. Where Kafka Broke
• Poor tooling outside of the JVM ecosystem
• Defaults work for many people, no documentation for tuning and
troubleshooting outside of a couple common scenarios
• Cross-network communication is problematic (multiple k8s clusters)
• Single bad actor can take down large clusters
19. Example One: Kafka & large messages
*Jiangjie Qin, LinkedIn
• Most benchmarks stop at < 1MB
message sizes
• No official guidelines for tuning for
large messages
• No official guidelines for tuning thread
pools or memory
• Most people say don’t use large
messages, and recommend building a
hybrid system
20. Example One: Kafka & large messages
java.io.IOException: Connection to 1 was disconnected before the response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96)
…
Presentation
Replicas falling behind (ISR), large messages, connection timeouts
Results
Can safely handle messages > 30 Mb at scale with a small cluster
num.io.threads ≈ 2 * CPU
num.network.threads ≈ 2 * CPU
num.replica.fetchers ≈ highest replica count of largest topic + (20, 50)%
replica.fetch.max.bytes ≈ 150% of max message size
Solution
Increase replication, IO, and network thread counts. Increase replica fetch size. Specific values vary
21. Example Two: Rebalance storms
Periodic leader election timeout,
consumer lag, node timeout
• A single bad consumer can take down a large
Kafka cluster
• Leader election timeouts, unresponsive Kafka
nodes fail health checks
• Default rolling restart behavior in k8s is
capable of triggering a cascading failure.
• Large messages, or slow consumers increase
the likelihood of instability.
Presentation
• Avoid slow, asynchronous libraries (aiokafka)
• Change k8s deployment policy to:
podManagementPolicy: "Parallel"
• Decrease batch sizes
Solution
22. We’re expanding to many systems, here’s a few:
Come talk to us, we want to help