Apache Kafka - Scalable Message Processing and more!

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Apache Kafka
Scalable Message Processing and more!
Guido Schmutz – 20.6.2017
@gschmutz guidoschmutz.wordpress.com

Guido Schmutz
Working at Trivadis for more than 20 years
Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
Apache Kafka - Scalable Message Processing and more!

COPENHAGEN
MUNICH
LAUSANNE
BERN
ZURICH
BRUGG
GENEVA
HAMBURG
DÜSSELDORF
FRANKFURT
STUTTGART
FREIBURG
BASEL
VIENNA
With over 600 specialists and IT experts in your region.
14 Trivadis branches and more than
600 employees
200 Service Level Agreements
Over 4,000 training participants
Research and development budget:
CHF 5.0 million
Financially self-supporting and
sustainably profitable
Experience from more than 1,900
projects per year at over 800
customers

Agenda
1. Introduction & Motivation
2. Kafka Core
3. Kafka Connect & Kafka Streams
4. Kafka and "Big Data" / "Fast Data" Ecosystem
5. Kafka in Enterprise Architecture
6. Summary

Apache Kafka - Motivation
LinkedIn’s motivation for Kafka was:
• "A unified platform for handling all the real-time data feeds a large company might
have."
Must haves
• High throughput to support high volume event feeds
• Support real-time processing of these feeds to create new, derived feeds.
• Support large data backlogs to handle periodic ingestion from offline systems
• Support low-latency delivery to handle more traditional messaging use cases
• Guarantee fault-tolerance in the presence of machine failures

Apache Kafka - Overview
Distributed publish-subscribe messaging system
Designed for processing of real time activity stream data (logs, metrics
collections, social media streams, …)
Initially developed at LinkedIn, now part of Apache
Does not use JMS API and standards
Kafka maintains feeds of messages in topics

Apache Kafka History
2012 2013 2014 2015 2016 2017
Cluster mirroring,
data compression
Intra-cluster
replication
0.7
0.8
0.9
Data Processing
(Streams API)
0.10
Data Integration
(Connect API)
Exactly Once
Semantics
0.11

Confluent Data Platform 3.2
Source: Confluent

Apache Kafka - Unix Analogy
$ cat < in.txt | grep "kafka" | tr a-z A-Z > out.txt
Kafka Connect API Kafka Connect APIKafka Streams API
Kafka Core (Cluster)
Source: Confluent

Kafka High Level Architecture
The who is who
• Producers write data to brokers.
• Consumers read data from
brokers.
• All this is distributed.
The data
• Data is stored in topics.
• Topics are split into partitions,
which are replicated.
Kafka Cluster
Consumer Consumer Consumer
Producer Producer Producer
Broker 1 Broker 2 Broker 3
Zookeeper
Ensemble

Apache Kafka - Architecture
Kafka Broker
Movement
Processor
Movement Topic
Engine-Metrics Topic
1 2 3 4 5 6
Engine
Processor1 2 3 4 5 6
Truck

Apache Kafka - Architecture
Kafka Broker
Movement
Processor
Movement Topic
Engine-Metrics Topic
1 2 3 4 5 6
Engine
Processor
Partition 0
1 2 3 4 5 6
Partition 0
1 2 3 4 5 6
Partition 1 Movement
Processor
Truck

Apache
Kafka
Kafka Broker 1
Movement
Processor
Truck
Movement Topic
P 0
Movement
Processor
1 2 3 4 5
P 2 1 2 3 4 5
Kafka Broker 2
Movement Topic
P 2 1 2 3 4 5
P 1 1 2 3 4 5
Kafka Broker 3
Movement Topic
P 0 1 2 3 4 5
P 1 1 2 3 4 5
Movement
Processor

Kafka Topics
Creating a topic
• Command line interface
• Using AdminUtils.createTopic method
• Auto-create via auto.create.topics.enable = true
Modifying a topic
https://kafka.apache.org/documentation.html#basic_ops_modify_topic
Deleting a topic
• Command Line interface
$ kafka-topics.sh –zookeeper zk1:2181 --create
--topic my.topic –-partitions 3
–-replication-factor 2 --config x=y

Kafka Producer
Write Ahead Log / Commit Log
Producers always append to tail (think append to file)
Order is preserved for messages within same partition
Kafka Broker
Movement Topic
1 2 3 4 5
Truck
6 6

Kafka Producer – High Level Overview
Producer Client
Kafka Broker
Movement Topic
Partition 0
Partitioner
Movement Topic
Serializer
Producer Record
message 1
message 2
message 3
message 4
Batch
Movement Topic
Partition 1
message 1
message 2
message 3
Batch
Partition 0
Partition 1
Retry
?
Fail
?
Topic
Message
[ Partition ]
[ Key ]
Value
yes
yes
if can’t retry:
throw exception
successful:
return metadata
Compression (optional)

Kafka Producer - Durability Guarantees
Producer can configure acknowledgements
Value Description Throughput Latency Durability
0 • Producer doesn’t wait for leader high low low (no
guarantee)
1
(default)
• Producer waits for leader
• Leader sends ack when message
written to log
• No wait for followers
medium medium medium
(leader)
all (-1) • Producer waits for leader & followers
• Leader sends ack when all In-Sync
Replicas have acknowledged
low high high (ISR)

Kafka Producer - Java API
Constructing a Kafka Producer
private Properties kafkaProps = new Properties();
kafkaProps.put("bootstrap.servers","broker1:9092,broker2:9092");
kafkaProps.put("key.serializer", "...StringSerializer");
kafkaProps.put("value.serializer", "...StringSerializer");
producer = new KafkaProducer<String, String>(kafkaProps);

Sending Message Synchronously (no control if message has been sent successful)
Sending Message Synchronously (wait until reply from Kafka arrives back)
ProducerRecord<String, String> record =
new ProducerRecord<>("topicName", "Key", "Value");
try {
producer.send(record);
} catch (Exception e) {}
ProducerRecord<String, String> record =
new ProducerRecord<>("topicName", "Key", "Value");
try {
producer.send(record).get();
} catch (Exception e) {}

Sending Message Asynchronously
private class DemoProducerCallback implements Callback {
@Override
public void onCompletion(RecordMetadata recordMetadata,
Exception e) {
if (e != null) {
e.printStackTrace();
}
}
}
ProducerRecord<String, String> record = new
ProducerRecord<>("topicName", "key", "value");
producer.send(record, new DemoProducerCallback());

Kafka Consumer - Consumer Groups
• common to have multiple applications
that read data from same topic
• each application should get all of the
messages
• unique consumer group assigned to
each application
• number of consumers (threads) per
group can be different
• Kafka scales to large number of
consumers without impacting
performance
Kafka
Movement Topic
Partition 0
Consumer Group 1
Consumer 1
Partition 1
Partition 2
Partition 3
Consumer 2
Consumer 3
Consumer 4
Consumer Group 2
Consumer 1
Consumer 2

Kafka Consumer - Consumer Groups
Kafka
Movement Topic
Partition 0
Consumer Group 1
Consumer 1
Partition 1
Partition 2
Partition 3
Kafka
Movement Topic
Partition 0
Consumer Group 1
Partition 1
Partition 2
Partition 3
Kafka
Movement Topic
Partition 0
Consumer Group 1
Partition 1
Partition 2
Partition 3
Kafka
Movement Topic
Partition 0
Consumer Group 1
Partition 1
Partition 2
Partition 3
Consumer 1
Consumer 2
Consumer 3
Consumer 4
Consumer 1
Consumer 2
Consumer 3
Consumer 4
Consumer 5
Consumer 1
Consumer 2
2 Consumers / each get messages from 2 partitions
1 Consumer / get messages from all partitions
5 Consumers / one gets no messages
4 Consumers / each get messages from 1 partition

Kafka Consumer - Partition offsets
Offset – A sequential id number assigned to messages in the partitions. Uniquely
identifies a message within a partition.
• Consumers track their pointers via (offset, partition, topic) tuples
• Kafka 0.10: seek to offset by given timestamp using method KafkaConsumer#offsetsForTimes
Consumer Group A Consumer Group B
1 2 3 4 5 6
Consumer at
“earliest” offset
Consumer at
“latest” offset
New data
from Producer
Consumer at
specific offset

Kafka Consumer - Java API
Constructing a Kafka Consumer
private Properties kafkaProps = new Properties();
kafkaProps.put("bootstrap.servers","broker1:9092,broker2:9092");
kafkaProps.put("group.id","MovementsConsumerGroup");
kafkaProps.put("key.serializer", "...StringSerializer");
kafkaProps.put("value.serializer", "...StringSerializer");
consumer = new KafkaConsumer<String, String>(kafkaProps);

Kafka Consumer - Java API
Kafka Consumer Poll Loop (with synchronous offset commit)
consumer.subscribe(Collections.singletonList("topic"));
try {
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
// process message, available information:
// record.topic(), record.partition(), record.offset(),
// record.key(), record.value());
}
consumer.commitSync();
}
} finally {
consumer.close();
}

Data Retention – 3 options
1. Never:
2. Time based (TTL): log.retention.{ms | minutes | hours}
3. Size based: log.retention.bytes
4. Log compaction based (entries with same key are removed):
kafka-topics.sh --zookeeper zk:2181
--create --topic customers
--replication-factor 1
--partitions 1
--config cleanup.policy=compact

Data Retention - Log Compaction
ensures that Kafka always retain at least the last known value for each message key
within a single topic partition
compaction is done in the background by periodically recopying log segments.
0 1 2 3 4 5 6 7 8 9 10 11
K1 K2 K1 K1 K3 K2 K4 K5 K5 K2 K6 K2
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
Offset
Key
Value
3 4 6 8 9 10
K1 K3 K4 K5 K2 K6
V4 V5 V7 V9 V10 V11
Offset
Key
Value
Compaction

Apache Kafka – Some numbers
Kafka at LinkedIn => over 1800+ broker machines / 79K+ Topics
Kafka Performance at our own infrastructure => 6 brokers (VM) / 1 cluster
• 445’622 messages/second
• 31 MB / second
• 3.0405 ms average latency between producer / consumer
1.3 Trillion messages per
day
330 Terabytes in/day
1.2 Petabytes out/day
Peak load for a single cluster
2 million messages/sec
4.7 Gigabits/sec inbound
15 Gigabits/sec outbound
http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
https://engineering.linkedin.com/kafka/running-kafka-scale

Inspecting the current state of a topic
Use the --describe option
• Leader: brokerID of the currently elected leader broker
• Replica ID’s = broker ID’s
• ISR = "in-sync replica", replicas that are in sync with the leader. In this example:
• Broker 0 is leader for partition 1.
• Broker 1 is leader for partitions 0 and 2.
• All replicas are in-sync with their respective leader partitions.
$ kafka-topics.sh –zookeeper zk1:2181 –-describe --topic my.topic
Topic:zerg2.hydra PartitionCount:3 ReplicationFactor:2 Configs:
Topic: my.topic Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0

Kafka Connect - Overview
Source: Confluent

Kafka Connect - Confluent Connectors
developed, tested, documented and are fully supported by Confluent
Source: http://www.confluent.io/product/connectors

Kafka Connect - Certified
Connectors
• developed by vendors utilizing the Kafka
Connect framework
• meet criteria for code development best
practices, schema registry integration, security,
and documentation

Kafka Connect – Community
Connectors
• Other notable Connectors that have been
developed utilizing the Kafka Connect
framework

Kafka Connect – Twitter example
./connect-standalone.sh ../demo-config/connect-simple-source-standalone.properties
../demo-config/twitter-source.properties
name=twitter-source
connector.class=com.eneco.trading.kafka.connect.twitter.TwitterSourceConnector
tasks.max=1
topic=tweets
twitter.consumerkey=<consumer-key>
twitter.consumersecret=<consumer-secret>
twitter.token=<token>
twitter.secret=<token-secret>
track.terms=bigdata
bootstrap.servers=localhost:9095,localhost:9096,localhost:9097
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter
...

Kafka Streams - Overview
• Designed as a simple and lightweight library in Apache
Kafka
• no external dependencies on systems other than Apache
Kafka
• Part of open source Apache Kafka, introduced in 0.10+
• Leverages Kafka as its internal messaging layer
• agnostic to resource management and configuration tools
• Supports fault-tolerant local state
• Event-at-a-time processing (not microbatch) with millisecond
latency
• Windowing with out-of-order data using a Google DataFlow-like
model

Streams API in the context of Kafka
Source: Confluent

Kafka and "Big Data" / "Fast Data"
Ecosystem

Kafka and the Big Data / Fast Data ecosystem
Kafka integrates with many popular products / frameworks
• Apache Spark Streaming
• Apache Flink
• Apache Storm
• Apache Apex
• Apache NiFi
• StreamSets
• Oracle Stream Analytics
• Oracle Service Bus
• Oracle GoldenGate
• Oracle Event Hub Cloud Service
• Debezium CDC
• …
Additional Info: https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem

Kafka in Enterprise Architecture

Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Traditional Big Data Architecture
BI Tools
Enterprise Data
Warehouse
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
File Import / SQL Import
SQL
Search / Explore
Online & Mobile
Apps
Search
NoSQL
Parallel Batch
Processing
Distributed
Filesystem
• Machine Learning
• Graph Algorithms
• Natural Language Processing

Event
Hub
Event
Hub
Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Event Hub – handle event stream data
BI Tools
Enterprise Data
Warehouse
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Event
Hub
Call
Center
Weather
Data
Mobile
Apps
SQL
Search / Explore
Online & Mobile
Apps
Search
Data Flow
NoSQL
Parallel Batch
Processing
Distributed
Filesystem
• Machine Learning
• Graph Algorithms
• Natural Language Processing

Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Event Hub – taking Velocity into account
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Call
Center
Mobile
Apps
Batch Analytics
Streaming Analytics
Results
Parallel Batch
Processing
Distributed
Filesystem
Stream Analytics
NoSQL
Reference /
Models
SQL
Search
Dashboard
BI Tools
Enterprise Data
Warehouse
Search / Explore
Online & Mobile
Apps
Weather
Data
Event
Hub
Event
Hub
Event
Hub

Container
Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Event Hub – Asynchronous Microservice Architecture
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Call
Center
Mobile
Apps
Parallel
Batch
ProcessingDistributed
Filesystem
Microservice
NoSQLRDBMS
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
Online & Mobile
Apps
Weather
Data
{ }
API
Event
Hub
Event
Hub
Event
Hub

Summary
• Kafka can scale to millions of messages per second, and more
• Provides the central backbone for your events/messages
• Easy to start with (Proof of Concept)
• Vibrant community and ecosystem
• Fast paced technology
• Confluent provides Confluent Platform with Support
• Confluent Cloud provides a managed Kafka service on AWS (later Google & Azure)
• Oracle Event Hub Service offers a Kafka Managed Service

Guido Schmutz
Technology Manager
guido.schmutz@trivadis.com
@gschmutz guidoschmutz.wordpress.com

Apache Kafka - Scalable Message Processing and more!

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Kafka - Scalable Message Processing and more!

Similar to Apache Kafka - Scalable Message Processing and more! (20)

More from Guido Schmutz

More from Guido Schmutz (20)

Recently uploaded

Recently uploaded (20)

Apache Kafka - Scalable Message Processing and more!