Apache Kafka + Zookeeper = 3.5 million writes per second as presented at http://www.meetup.com/hyderabad-scalability/events/220582368/ by Ranganathan B. This is a basic overview of Kafka.
2. Trend of software development
Src: http://eil.stanford.edu/publications/david_liu/david_dissertation.pdf
3. Pre-Kafka days in LinkedIn
src: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
4. Pre-Kafka days in LinkedIn
src: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
Any problem, how would you solve?
6. Kafka is
● publish-subscribe messaging service
● distributed commit/write-ahead log
● decouples data pipelines
● per partition ordered
producers produce, consumers
consume, in large distributed way
7. Kafka characteristics
● fast O(1)
high throughput - 3.25 million writes per
second
● scalable - (300+ brokers at LinkedIn)
● durable
● distributed
● replicated (fault - tolerance)
8. Pre-Kafka days in LinkedIn
src: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
9. Introducing Kafka at LinkedIn
src: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
10. Why not RabbitMQ/ActiveMQ/….?
● Existing queues - http://queues.io/
● For highly distributed messages, Kafka stands out.
● Consumer messages are ordered per partition.
● Good number of language api support and integration
apis.
● Fast reads (efficient use of page cache) and fast writes
(efficient transfer from page cache to network sockets -
zero copy optimization)
12. Timeline
● Originally developed at LinkedIn
● Open sourced in 2011, as version 0.6
● Graduated from Apache - Oct 2012
● Written in Scala
● Latest stable - 0.8.2.0
13. Messaging Terminology
● Data is called message.
● Producers publish messages.
● Messages are stored in topics.
● Topics are partitioned and replicated into Kafka
servers.
● Each Kafka server in a cluster is called Broker.
● Consumers consume messages from brokers.
14. Producers send messages over network to
Kafka cluster which in turn serves consumers.
Producer ProducerProducer
Consumer Consumer Consumer
Broker Broker Broker Broker
Kafka Cluster
Consumer Consumer
TCP
TCP
15. Topics
….
Remove messages based on
● number of messages: log.flush.interval
● time: log.default.flush.interval.ms,topic.flush.intervals.ms
● size: log.retention.size
….
16. Partition
…. ….
Partition logic can be
● default (kafka.producer.DefaultPartitioner - based on hash of
key)
● custom (by requirement. e.g: user-id - if we need more
processing based on user-id)
17. Partition
● Ordered, immutable sequence of messages
● Each message is assigned unique offset
● Serves
o Horizontal scaling
o Parallel consumer reads (with consumption by
partition based order)
19. Producers - push
● batching
● compression
● sync(Ack), async(auto batch - say 60k or
10ms)
● sequential writes - guaranteed order per
partition
20. Consumers - pull
● Queue of consumers (consumer group)
● Position based on offset, controlled by
consumer and persisted at intervals into
topic - __consumer_offsets
● can rewind offset
● Guaranteed order per partition
● More partitions enables better parallel reads
21. Message delivery guarantees
● At most once - Read -> Save position ->Process
● At least once - Read -> Process -> Save position
● Exactly once - Write output and position at same place.
With Kafka:
● At least once - (Default)
● At most once - (Disable Producer retries and save offset before processing)
● Exactly once - Use offset to store at destination system
Event Sourcing ensures that all changes to application state are stored as a sequence of events. Not just can we query these events, we can also use the event log to reconstruct past states, and as a foundation to automatically adjust the state to cope with retroactive changes.