In the last few years, Apache Kafka has been used extensively in enterprises for real-time data collecting, delivering, and processing. In this presentation, Jun Rao, Co-founder, Confluent, gives a deep dive on some of the key internals that help make Kafka popular.
- Companies like LinkedIn are now sending more than 1 trillion messages per day to Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
- Many companies (e.g., financial institutions) are now storing mission critical data in Kafka. Learn how Kafka supports high availability and durability through its built-in replication mechanism.
- One common use case of Kafka is for propagating updatable database records. Learn how a unique feature called compaction in Apache Kafka is designed to solve this kind of problem more naturally.
7. Detailed Log Representation
7
offset 0 - 10000
timestamp
index
offset
index
offset 10001 - 20000 offset 20001 - 30000
offset
index
offset
index
timestamp
index
timestamp
index
11. Kafka Replication
• Configurable replication factor
• Tolerating f – 1 failures with f replicas
• Unlike quorum based replication
• Automated failover
13. High Level Data Flow in Replication
broker 1
producer
leader
broker 2
follower
broker 3
follower
4
2
2
3
commit
ack
When producer receives ack Latency Durability on failures
acks=0 (no ack) no network delay some data loss
acks=1 (wait for leader) 1 network roundtrip a few data loss
acks=all (wait for committed) 2 network roundtrips no data loss
topic1-part1 topic1-part1 topic1-part1
consumer
1
32. Cleaning Configs
• log.cleaner.min.cleanable.ratio (default 0.5)
• dirty/total ratio when log cleaner is triggered
• log.cleaner.io.max.bytes.per.second (default infinite)
• Max rate cleaning can be done
• Can be used for throttling
33. Be Careful with Deletes
• Delete tombstone modeled as null message
• Danger of removing a deleted key too soon
• Consumer still assumes the old value with the key
• log.cleaner.delete.retention.ms (default 1 day)
• “Delete tombstone” removed after that time
• Consumer needs to finish consuming the tombstone before that
time
34. Summary
• Apache Kafka is a streaming platform
• The storage part supports
• High throughput
• High availability and durability
• Retaining database-like data
35. Coming Up Next
Date Title Speaker
10/27 Data Integration with Kafka Gwen Shapira
11/17 Demystifying Stream Processing Neha Narkhede
12/1 A Practical Guide To Selecting A Stream
Processing Technology
Michael Noll
12/15 Streaming in Practice: Putting Apache
Kafka in Production
Roger Hoover
Hinweis der Redaktion
It’s widely used and in production at thousands of companies.
Let’s walk through the the basics of Kafka and understand how it acts as a streaming platform.
Now we’ve talked about these two motivations for streams---solving pipline spawl and asynchronous stream processing.
It won’t surprise anyone that when I talk about this streaming platform that enables these pipelines and processing I am talking about Apache Kafka.
New producer api.
Be aware of non-java api.
Put partitions inside topic?
Maybe useful to use another picture.
3-tier
New producer api.
Be aware of non-java api.
Leaders spread over brokers. Another picture. Animated transitioin.