2. What is Kakfa?
Kafka is a distributed publish-subscribe messaging system rethought as a distributed
commit log.
It’s designed to be
Fast
Scalable
Durable
When used in the right way and for the right use case, Kafka has unique attributes
that make it a highly attractive option for data integration.
3. Publish subscribe messaging system
Kafka maintains feeds of messages in categories called topics
Producers are processes that publish messages to one or more topics
Consumers are processes that subscribe to topics and process the feed of
published messages
Subscriber
Subscriber
Subscriber
Publisher
Message
Message
Message
Message Topic
4. Kafka cluster
Since Kafka is distributed in nature, Kafka is run as a cluster.
A cluster is typically comprised multiple servers; each of which is called a broker.
Communication between the clients and the servers takes place over TCP protocol
Kafka cluster
Consumer
Consumer
Consumer
Producer
Producer
Producer
Broker 1
Topic 1 Topic 2
Broker 2
Topic 1 Topic 2
Zookeeper
6. Topic
To balance load, a topic is divided into
multiple partitions and replicated
across brokers.
Partitions are ordered, immutable
sequences of messages that’s
continually appended i.e. a commit log.
The messages in the partitions are each
assigned a sequential id number called
the offset that uniquely identifies each
message within the partition.
7. Partitions allow a topic’s log to scale beyond a size that will fit on a single server (i.e. a
broker) and act as the unit of parallelism
The partitions of a topic are distributed over the brokers in the Kafka cluster where each
broker handles data and requests for a share of the partitions.
For fault tolerance, each partition is replicated across a configurable number of brokers.
Distribution and partitions
8. Distribution and fault tolerance
Each partition has one server which acts as the "leader" and zero or more servers
which act as "followers".
The leader handles all read and write requests for the partition while the followers
passively replicate the leader.
If the leader fails, one of the followers will automatically become the new leader.
Each server acts as a leader for some of its partitions and a follower for others so load
is well balanced within the cluster.
9. Retention
The Kafka cluster retains all published messages—whether or not they have been
consumed—for a configurable period of time; after which it will be discarded to
free up space.
Metadata retained on a per-consumer basis is the position of the consumer in the
log, called the offset; which is controlled by consumer.
Normally a consumer will advance its offset linearly as it reads messages, but it can
consume messages in any order it likes.
Kafka consumers can come and go without much impact on the cluster or on other
consumers.
10. Producers
Producers publish data to the topics by assigning messages to a partition within the
topic either in a round-robin fashion or according to some semantic partition function
(say based on some key in the message).
11. Consumers
Kafka offers a single consumer abstraction called consumer group that generalises
both queue and topic.
Consumers label themselves with a consumer group name.
Each message published to a topic is delivered to one consumer instance within each
subscribing consumer group.
If all the consumer instances have the same consumer group, then this works just like
a traditional queue balancing load over the consumers.
If all the consumer instances have different consumer groups, then this works like
publish-subscribe and all messages are broadcast to all consumers.
12. Consumer groups
Topics have a small number of consumer groups, one for each logical subscriber.
Each group is composed of many consumer instances for scalability and fault tolerance.
13. Ordering guarantees
Kafka assigns partitions in a topic to consumers in a consumer group so, each partition is
consumed by exactly one consumer in the group.
Limitation: there cannot be more consumer instances in a consumer group than partitions.
Provides a total order over messages within a partition, not between different partitions in
a topic.
14. Comaprison
Kafka JMS message broker; Rabbit MQ
A fire hose of events arriving at rate of
approximately 100k+/sec
Messages arriving at a rate of 20k+/sec
‘At least once‘ processed as data is read
with an offset within a partition.
Exactly once processed by consumers
Producer-centric. Doesn't have message
acknowledgements as consumers track
messages consumed.
Broker-centric. Uses the broker itself to
maintain state of what's consumed (via
message acknowledgements)
Supports both online and batch
consumers that may be online or offline. It
also supports producer message batching
- it's designed for holding and distributing
large volumes of messages at a very low
latency.
Consumers are mostly online, and any
messages "in wait" (persistent or not) are
held opaquely.
15. Comaprison
Kafka JMS message broker; Rabbit MQ
Provides a rudimentary routing. It uses
topic for exchanges.
Provides rich routing capabilities with
Advanced Message Queuing Protocol’s
(AMQP) exchange, binding and queuing
model.
Makes distributed cluster explicit, by
forcing the producer to know it is
partitioning a topic's messages across
several nodes.
Makes the distributed cluster transparent,
as if it were a virtual broker
Preserves ordered delivery within a
partition
Almost always unordered delivery. AMQP
model says "one producer channel, one
exchange, one queue, one consumer
channel" is required for in-order delivery
16. Throttling is un-necessary
The whole job of Kafka is to provide a "shock absorber" between the flood of
events and those who want to consume them in their own way.
17. Performance benchmark
500,000 messages published per second
22,000 messages consumed per second
on a 2-node cluster
with 6-disk RAID 10.
See research.microsoft.com/en-
us/um/people/srikanth/netdb11/netdb11papers/net
db11-final12.pdf
18. Key benefits
Horizontally scalable
It’s a distributed system can be elastically and transparently expanded with no downtime
High throughput
High throughput is provided for both publishing and subscribing, due to disk structures
that provide constant performance even with many terabytes of stored messages
Reliable delivery
Persists messages on disk, and provides intra-cluster replication
Supports large number of subscribers and automatically balances consumers in case of
failure.
19. Use cases
Common use cases include
1. Stream processing, Event sourcing or a replacement for a more traditional message
broker
2. Website activity tracking - original use case for Kafka
3. Metrics collection and monitoring - centralized feeds of operational data
4. Log aggregation
21. Download and extract Kafka
Download the archive from
kafka.apache.org/downloads.html and
extract it
22. Kafka uses ZooKeeper for cluster coordination
Kafka uses ZooKeeper; which enables
highly reliable distributed coordination so,
one needs to first start a ZooKeeper server.
Kafka bundles a single-node ZooKeeper
instance. Single node zookeeper cluster
does NOT run a leader and a follower.
Typically exchanged metadata include
Kafka broker addresses
Consumed messages offset
23. Common Challenges faced by distributed
system
Outages
Co-ordination of tasks
Reduction of operational complexity
Consistency and ordering guarantees
Zookeeper to rescue
24. Apache Zookeeper: Definition
Centralised service for
Maintaining configuration information
Naming
Distributed synchronisation and
providing group services.
25. Apache Zookeeper: Features
Distributed consistent data store which favours consistency over everything else.
High availability - Tolerates a minority of an ensemble members being unavailable and
continues to function correctly.
In an ensemble of n members where n is an odd number, the loss of (n-1)/2 members can be
tolerated.
High performance - All the data is stored in memory and benchmarked at 50k ops/sec but
the numbers really depend on your servers and network
Tuned for read heavy write light work load. Reads are served from the node to which a client a
connected.
Provides strictly ordered access for data.
Atomic write guarantees in the order they're sent to zookeeper.
Writes are acknowledged and changes are also seen in the order they occurred.
26. Apache Zookeeper: Operation basics
A cluster is an ensemble with a leaders and several followers
Read requests are serviced from each server using its local replica but write request are forwarded to a
leader. When the leader receives a write request, it calculates what the state of the system is when the
write is to be applied and transforms this into a transaction that captures this new state.
When zookeeper starts, it goes through a loading algorithm where by one node of the cluster is
elected to act as the leader. At any given point in time, only one node acts as a leader.
27. Apache Zookeeper: Operation basics
Clients create a state full session (i.e. with heartbeats through an open socket) when they
connect to a node of an ensemble. Number of open sockets available on a zookeeper node will
limit the number of clients that connect to it.
When a cluster member dies, clients notice a disconnect event and thus reconnect themselves
to another member of the quorum.
Session (i.e. state of the client connected to node of an ensemble) stay alive when the client
goes down, as the session events go through the leader and gets replicated in a cluster onto
another node.
When the leader goes down, remaining members of the cluster will re-elect a new leader using
a atomic broadcast consensus algorithm. Cluster remains unavailable only when it re-elects a
new leader.
Each individual partition must fit on the servers that host it. However, a topic may have many partitions so it can handle an arbitrary amount of data.
For example if the log retention is set to two days, then for the two days after a message is published it is available for consumption, after which it will be discarded to free up space.
Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing.