Introducing Apache Kafka - a visual overview. Presented at the Canberra Big Data Meetup 7 February 2019. We build a Kafka "postal service" to explain the main Kafka concepts, and explain how consumers receive different messages depending on whether there's a key or not.
2. What is Kafka?
Message flow
Distributed streams
Processing System
Messages sent by
Distributed Producers
to
Distributed Consumers
Consumers
via
Distributed Kafka Cluster
Cluster
3. Kafka Benefits
Fast
Scalable
Reliable
Durable
Open Source
Managed Service
Kafka benefits
4. Kafka Benefits
Fast – high throughput and low latency
Scalable – horizontally scalable with nodes and partitions
Reliable – distributed and fault tolerant
Durable - zero data loss, messages persisted to disk with immutable log
Open Source – An Apache project
Available as a Managed Service - on multiple cloud platforms
Kafka benefits
14. “Poste Restante”?
• Not a post office in a
restaurant
• General delivery (in the
US)
• The mail is delivered to
a post office, they hold
it for you until you call
15. Consumers “poll” for messages by visiting the
Poste Restante counter at the post office
17. Benefits include
• Disconnected delivery –
consumer doesn’t need to
be available to receive
messages
• Less effort for the
messaging service – only
has to deliver to a few
locations not every
consumer
• Can scale better and
handle more complex
delivery semantics!
29. Santa
North PoleTopic
Timestamp, offset, partition
Key -> Partition (optional)
Kafka Producers and Consumers
need a serializer and de-serializer
to write & read key and value
30. • Kafka doesn’t look
at the value
• Consumer can
read value
• And try to make
sense of the
message
• What will Santa
be delivering?!
52. Producer
Partition 1
Partition 2
Partition n
Topic “Parties”
C1
Consumer Group
Consumer Group
Consumer
Consumer
Consumer
Consumer
Consumers subscribed to topic are allocated partitions
They will only get messages from their allocated partitions.
53. Producer
Partition 1
Partition 2
Partition n
Topic “Parties”
C1
Consumer Group
Consumer
Consumer
Consumer
Consumers share work
within groups
Consumers in the same group share the work around
Each consumer gets only a subset of messages
54. Producer
Partition 1
Partition 2
Partition n
Topic “Parties”
C1
Consumer Group
Consumer Group
Consumer
Consumer
Consumer
Consumer
Messages are duplicated across
Consumer groups
Multiple groups enable message broadcasting
Messages are duplicated across groups, each consumer
group receives a copy of each message.
55. Key
Partition based delivery
Which messages are delivered to
which consumers?
If a message has a key, then Kafka
uses Partition based delivery.
Messages with the same key are
always sent to the same partition
and therefore the same
consumer.
And the order is guaranteed.
56. No Key
Round robin delivery
If the key is null, then
Kafka uses round robin
delivery
Each message is delivered
to the next partition
59. Consumer Group = Nerds
Multiple consumers
Consumer Group = Hairy
Single consumer
60. Producer
Partition 1
Partition 2
Partition n
Topic “Parties”
C1
Group “Nerds”
Group “Hairy”
Consumer 1 (Bill)
Consumer 2 (Paul)
Consumer n
Consumer 1 (Chewy)
Consumers
Subscribed to “Parties”
No Key
Round Robin
M1
M2
etc
M1
M1
M1
M2
M2
M2
Case 1: No Key
Message (M1, M2, etc) sent to the next partition
All consumers allocated to that partition will receive a message when they poll next.
62. 1. Both Groups subscribe to Topic “parties” (11 partitions, so 1 consumer per partition).
Subscribe to “Parties” Subscribe to “Parties”
63. 1. Both Groups subscribe to Topic “parties” (11 partitions, so 1 consumer per partition).
2. Producer sends record “Cool pool party – Invitation”
<key=null, value=“Cool pool party - Invitation”> to “parties” topic (no key)
64. 1. Both Groups subscribe to Topic “parties” (11 partitions, so 1 consumer per partition).
2. Producer sends record “Cool pool party - Invitation”> to “parties” topic
3. Bill and Chewbacca receive a copy of the invitation and plan to attend
65. 4. Producer sends another record “Cool pool party – Cancelled”
<key=null, value=“Cool pool party - Cancelled”> to “parties” topic
66. 4. Producer sends another record <key=null, value=“Cool pool party - Cancelled”> to “parties” topic
5. Paul and Chewbacca receive the cancellation.
Paul gets the message this time as it’s round robin, ignores it as he didn’t get the invitation. Bill wastes his
time trying to go to cancelled party. The rest of the gang aren’t surprised at not receiving any party invites and
stay at home to do some hacking. Chewy is only consumer in his group so gets all messages, plans something
fun instead…
67.
68. Producer
Partition 1
Partition 2
Partition n
Topic “Parties”
C1
Group “Nerds”
Group “Hairy”
Consumer 1 (Bill)
Consumer 2 (Paul)
Consumer n
Consumer 1 (Chewy)
Consumers
Subscribed to “Parties”
Key
Hashed to partition
M1, M2
etc
M1, M2
M1, M2
M1, M2
M3
M3
M3
M3
Case 2: If there is a Key
A key is hashed to a partition, and a Message with that key is always sent to that partition.
Assume there are 3 messages, and messages 1 and 2 are hashed to same partition.
69. Here’s what happens with a key: key is “title” of the message (e.g. “Cool pool party”)
Same set up as before:
1. Both Groups subscribe to Topic “parties” (11 partitions).
70. 1. Both Groups subscribe to Topic “parties” (11 partitions).
2. Producer sends record <key=“Cool pool party”, value=“Invitation”> to “parties” topic
71. 1. Both Groups subscribe to Topic “parties” (11 partitions).
2. Producer sends record <key=“Cool pool party”, value=“Invitation”> to “parties” topic
3. As before Bill and Chewbacca receive a copy of the invitation and plan to attend
72. 4. Producer sends another record <key=“Cool pool party”, value=“Cancelled”> to “parties” topic
73. 4. Producer sends another record <key=“Cool pool party”, value=“Cancelled”> to “parties” topic
5. Bill and Chewbacca receive the cancellation (same consumers this time, as identical key)
74. 6. Producer sends another record <key=“Horrible Halloween party”, value=“Invitation”> to ”parties” topic
75. 6. Producer sends another record <key=“Horrible Halloween party”, value=“Invitation”> to ”parties” topic
7. Paul and Chewy receive the invitation
Paul receives the Halloween invitation as the key is different and the record is sent to the partition that Paul is
allocated to
Chewy is the only consumer in his group so he gets every record no matter what partition it’s sent to
79. Real-time data pipeline
Read-time data pipeline features:
• Ingestion of multiple heterogeneous sources
• Sending data to multiple heterogeneous sinks
• Acts as a buffer to smooth out load spikes
• Enables use cases which reprocess data (e.g. disaster recovery)
80. Anomaly Detection Pipeline
Real-time Event processing pipeline:
• Simple event driven applications (If X then Y…)
• May write and read from other data sources (e.g. Cassandra)
• New Events sent back to Kafka or to other systems
• E.g. Anomaly Detection, check out my current blog series if you are interested in this example.
81. Kafka Streams Processing (Kongo IoT Blog series)
Streams processing features:
• Complex streams processing (multiple events and streams)
• Time, windows, and transformations
• Uses Kafka Streams API, includes state store
• Visualization of the streams topology
• Continuously computes the loads for trucks and checks if they are overloaded.
82. Linkedin - Before Kafka (BK)
A real example from Linkedin, who developed Kafka.
Before Kafka they had spaghetti integration of monolithic applications.
To accommodate growing membership and increasing site complexity, they migrated from a monolithic
application infrastructure to one based on microservices, which made the integration even more complex!
83. After Kafka (AK)
Rather than maintaining and scaling each pipeline individually, they invested in the
development of a single, distributed pub-sub platform - Kafka was born.
The main benefit was better Service decoupling and independent scaling.
84. The End (of the introduction) -
Find out more
Apache Kafka: https://kafka.apache.org/
Instaclustr blogs
• Mix of Cassandra, Spark, Zeppelin and Kafka
https://www.instaclustr.com/paul-brebner/
• Kafka introduction
https://insidebigdata.com/2018/04/12/developing-deeper-understanding-apache-kafka-architecture/
https://insidebigdata.com/2018/04/19/developing-deeper-understanding-apache-kafka-architecture-part-2-
• Kongo – Kafka IoT logistics application blog series
https://www.instaclustr.com/instaclustr-kongo-iot-logistics-streaming-demo-application/
• Anomaly detection with Kafka and Cassandra (and Kubernetes), current blog series
https://www.instaclustr.com/anomalia-machina-1-massively-scalable-anomaly-detection-with-apache-kafka-
Instaclustr’s Managed Kafka (Free trial)
https://www.instaclustr.com/solutions/managed-apache-kafka/
Hinweis der Redaktion
What is Kafka? Kafka is a distributed streams processing system, it allows distributed producers to send messages to distributed consumers via a Kafka cluster.
Kafka benefits
Fast – high throughput and low latency
Scalable – horizontally scalable, just add nodes and partitions
Reliable – distributed and fault tolerant
Zero data loss – messages are persisted to disk with immutable log
Open Source – An Apache project
Available as a Managed Service - on multiple cloud platforms
Kafka benefits
Fast – high throughput and low latency
Scalable – horizontally scalable, just add nodes and partitions
Reliable – distributed and fault tolerant
Zero data loss – messages are persisted to disk with immutable log
Open Source – An Apache project
Available as a Managed Service - on multiple cloud platforms
Back to the intro Kafka overview diagram, it’s a bit monochrome and boring. This talk will be more colourful and it’s going to be an extended story…
Back to the intro Kafka overview diagram, it’s a bit monochrome and boring. This talk will be more colourful and it’s going to be an extended story…
Let’s build a modern day fully electronic postal service – the Kafka Postal Service
What does a postal service do? It sends messages from A to B (animated with sounds, click!)
A is a producer, sends a message…
To B, the consumer.
Actually, no. Due to the decline in “snail mail” volumes, direct deliveries have been cancelled.
Actually, no. Due to the decline in “snail mail” volumes, direct deliveries have been cancelled.
Instead we have “Poste Restante” - Not a post office in a restaurant
It’s called general delivery (in the US).
The mail is delivered to a post office, and they hold it for you until you call for it.
Consumers poll for messages by visiting the counter at the post office.
Consumers poll for messages by visiting the counter at the post office.
Kafka topics act like a Post Office. What are the benefits?
Disconnected delivery – consumer doesn’t need to be available to receive messages
Less effort for the messaging service – only has to delivery to a few locations not larger number of consumer addresses
Can scale better and handle more complex delivery semantics!
Kafka topics act like a Post Office. What are the benefits?
Disconnected delivery – consumer doesn’t need to be available to receive messages
Less effort for the messaging service – only has to delivery to a few locations not larger number of consumer addresses
Can scale better and handle more complex delivery semantics!
First lets see how it scales. What if there are many consumers for a topic?
A single counter introduces delays and limits concurrency
More counters increases concurrency and can reduce delays
Kafka Topics have 1 or more Partitions, partitions function like multiple counters and enable high concurrency.
Before looking at delivery semantics what does a message actually look like? In Kafka a message is called a Record and is like a letter.
The topic is the destination.
The “Postmark” includes a timestamp, offset in the topic, and the partition it was sent to. Time semantics are flexible, either the time of event creation, ingestion, or processing.
The “Postmark” includes a timestamp, offset in the topic, and the partition it was sent to. Time semantics are flexible, either the time of event creation, ingestion, or processing.
There’s also a thing called a Key, which is is optional.
It refines the destination so it’s a bit like the rest of the address. We want this letter sent to Santa not just any Elf.
There’s also a thing called a Key, which is is optional.
It refines the destination so it’s a bit like the rest of the address. We want this letter sent to Santa not just any Elf.
The value is the contents (just a byte array). Kafka Producers and consumers need to have a shared serializer and de-serializer for both the key and value.
The value is the contents (just a byte array). Kafka Producers and consumers need to have a shared serializer and de-serializer for both the key and value.
Kafka doesn’t look inside the value, but the Producer and Consumer can, and the Consumer can try and make sense of the message…
I wonder what Santa will be delivering?
Next lets look at delivery semantics. For example, do we care if the message actually arrives or not?
Yes we do! Guaranteed message delivery is desirable.
Homing pigeons got lost or eaten, so need to send the message with multiple pigeons
Yes we do! Guaranteed message delivery is desirable.
Homing pigeons got lost or eaten, so need to send the message with multiple pigeons
Yes we do! Guaranteed message delivery is desirable.
Homing pigeons got lost or eaten, so need to send the message with multiple pigeons
Yes we do! Guaranteed message delivery is desirable.
Homing pigeons got lost or eaten, so need to send the message with multiple pigeons
How does Kafka guarantee delivery?
A Message (M1) is written to a broker (2)
How does Kafka guarantee delivery?
A Message (M1) is written to a broker (2)
and the message is always persisted to disk.
This makes it resilient to loss of power.
The message is also replicated on multiple “brokers” , 3 is typical
And makes it resilient to loss of most servers
Finally the producer gets acknowledgement once the message is persisted and replicated (configurable for number, and sync or async)
The message is now available from more than one broker in case some fail.
This also increases the read concurrency as partitions are spread over multiple brokers.
Finally the producer gets acknowledgement once the message is persisted and replicated (configurable for number, and sync or async)
The message is now available from more than one broker in case some fail.
This also increases the read concurrency as partitions are spread over multiple brokers.
Now let’s look at the 2nd aspect of delivery semantics.
Who gets the messages and how many times are messages delivered?
Kafka is “pub-sub”, It’s loosely coupled, producers and consumers don’t know about each other
Now let’s look at the 2nd aspect of delivery semantics.
Who gets the messages and how many times are messages delivered?
Kafka is “pub-sub”, It’s loosely coupled, producers and consumers don’t know about each other
Filtering, or which consumers get which messages, is topic based
Publishers send messages to topics
Consumers subscribe to topics of interest, e.g. parties. When they poll they only receive messages sent to those topics.
Filtering, or which consumers get which messages, is topic based
Publishers send messages to topics
Consumers subscribe to topics of interest, e.g. parties. When they poll they only receive messages sent to those topics.
Filtering, or which consumers get which messages, is topic based
Publishers send messages to topics
Consumers subscribe to topics of interest, e.g. parties. When they poll they only receive messages sent to those topics.
Filtering, or which consumers get which messages, is topic based
Publishers send messages to topics
Consumers subscribe to topics of interest, e.g. parties. When they poll they only receive messages sent to those topics.
Just a few more details and we can see how this works. Partitions and consumer groups enable sharing of work across multiple consumers, the more partitions a topic has the more consumers it supports
Kafka supports delivery of the same message to multiple consumers. Kafka doesn’t throw messages away immediately they are delivered, so the same message can easily be delivered to multiple consumer groups.
Consumers subscribed to ”parties” topic are allocated partitions. When they poll they will only get messages from their allocated partitions.
This enables consumers in the same group to share the work around. Each consumer gets only a subset of the available messages.
Multiple groups enable message broadcasting. Messages are duplicated across groups, as each consumer group receives a copy of each message.
Which messages are delivered to which consumers? The final aspect of delivery semantics is to do with message keys.
If a message has a key, then Kafka uses Partition based delivery.
Messages with the same key are always sent to the same partition and therefore the same consumer.
And the order is guaranteed.
If the key is null, then Kafka uses round robin delivery. Each message is delivered to the next partition.
Let’s look at an example with 2 consumer groups. Nerds, which has multiple consumers, and Hairy which has a single consumer.
Let’s look at an example with 2 consumer groups. Nerds, which has multiple consumers, and Hairy which has a single consumer.
Let’s look at an example with 2 consumer groups. Nerds, which has multiple consumers, and Hairy which has a single consumer.
Looking at the case where there’s No Key 1st, each message (1, 2, etc) is sent to the next partition, and all consumers allocated to that partition will receive the message when they poll next.
Here’s what actually happens. We’re not showing the producer or topic for simplicity. You’ll have to imagine them.
Both Groups subscribe to Topic “parties” (11 partitions, so 1 consumer per partition).
2 Producer sends record with the value “Cool pool party - Invitation” to “parties” topic (there’s no key)
3 Bill and Chewbacca receive a copy of the invitation and plan to attend
4. Producer sends another record with the value “Cool pool party – Cancelled” to “parties” topic
In the Nerds group, Paul gets the message this time as it’s round robin, and Chewy gets it as he’s the only consumer in his group.
Paul ignores it as he didn’t get the original invite
Bill wastes his time trying to go
The rest of the gang aren’t surprised at not receiving any invites and stay home to do some hacking
Chewy plans something else fun instead...
A visit to the hairdressers!
How does it work if there is a Key? The key is hashed to a partition, and the Message is sent to that partition.
Assume there are 3 messages, and messages 1 and 2 are hashed to same partition.
Here’s what happens with a key, assuming that the key is the “title” of the message (“Cool pool party”)
As before Both Groups subscribe to Topic “parties” (11 partitions).
Producer sends record <key=“Cool pool party”, value=“Invitation”> to “parties” topic
Here’s what happens with a key, assuming that the key is the “title” of the message (“Cool pool party”)
As before Both Groups subscribe to Topic “parties” (11 partitions).
Producer sends record <key=“Cool pool party”, value=“Invitation”> to “parties” topic
As before, Bill and Chewbacca receive a copy of the invitation and plan to attend
4. Producer sends another record with the same key but a cancellation value to “parties” topic
This time, Bill and Chewbacca receive the cancellation (the same consumers as the key is identical)
Producer sends out another invitation to a Halloween party. The key is different this time.
Paul receives the Halloween invitation as the key is different and the record is sent to the partition that Paul is allocated
Chewy is the only consumer in his group so he gets every record no matter what partition it’s sent to
This time Chewy gets dressed up and goes to the party.
Who did you imagine was producing the invitations? Maybe this fellow.
Here’s a UML-like diagram with the main Kafka components. We’ve introduced Producers, Topics, Partitions, Consumer Groups and Consumes today.
There’s a lot more to explore, including how Kafka provides replication, and the Connect and Streaming APIs.
To finish up here are three important use cases for Kafka
The Real-time Data pipeline features
Ingestion of multiple heterogeneous sources
Sending data to multiple heterogeneous sinks
Acts as a buffer to smooth out load spikes
Enables use cases which reprocess data (e.g. disaster recovery)
Real-time Event processing features:
Simple event driven applications (If X then Y…)
May write and read from other data sources (e.g. Cassandra)
New Events sent back to Kafka or to other systems
E.g. Anomaly Detection, check out my current blog series if you are interested in this example.
Streams processing features:
Complex streams processing (multiple events and streams)
Time, windows, and transformations
Streams API, includes state store
This a visualization of the streams topology from the streams processing pipeline from my previous blog series, Kongo, a Kafka IoT logistics application.
It continuously computes the loads for trucks and checks if they are overloaded.
Here’s a real example from Linkedin, who developed Kafka. Before Kafka they had spaghetti integration of monolithic applications.
To accommodate growing membership and increasing site complexity, they migrated from a monolithic application infrastructure to one based on microservices, which made the integration even more complex.
After Kafka
Rather than maintaining and scaling each pipeline individually, they invested in the development of a single, distributed pub-sub platform. Thus, Kafka was born. This main benefit was better Service decoupling and independent scaling.