SlideShare a Scribd company logo
1 of 49
Download to read offline
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Apache Kafka
Scalable Message Processing and more!
Guido Schmutz – 20.6.2017
@gschmutz guidoschmutz.wordpress.com
Guido Schmutz
Working at Trivadis for more than 20 years
Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
Apache Kafka - Scalable Message Processing and more!
COPENHAGEN
MUNICH
LAUSANNE
BERN
ZURICH
BRUGG
GENEVA
HAMBURG
DÜSSELDORF
FRANKFURT
STUTTGART
FREIBURG
BASEL
VIENNA
With over 600 specialists and IT experts in your region.
14 Trivadis branches and more than
600 employees
200 Service Level Agreements
Over 4,000 training participants
Research and development budget:
CHF 5.0 million
Financially self-supporting and
sustainably profitable
Experience from more than 1,900
projects per year at over 800
customers
Agenda
1. Introduction & Motivation
2. Kafka Core
3. Kafka Connect & Kafka Streams
4. Kafka and "Big Data" / "Fast Data" Ecosystem
5. Kafka in Enterprise Architecture
6. Summary
Introduction & Motivation
Apache Kafka - Motivation
LinkedIn’s motivation for Kafka was:
• "A unified platform for handling all the real-time data feeds a large company might
have."
Must haves
• High throughput to support high volume event feeds
• Support real-time processing of these feeds to create new, derived feeds.
• Support large data backlogs to handle periodic ingestion from offline systems
• Support low-latency delivery to handle more traditional messaging use cases
• Guarantee fault-tolerance in the presence of machine failures
Apache Kafka - Overview
Distributed publish-subscribe messaging system
Designed for processing of real time activity stream data (logs, metrics
collections, social media streams, …)
Initially developed at LinkedIn, now part of Apache
Does not use JMS API and standards
Kafka maintains feeds of messages in topics
Apache Kafka History
2012 2013 2014 2015 2016 2017
Cluster	mirroring,
data	compression
Intra-cluster
replication
0.7
0.8
0.9
Data	Processing
(Streams	API)
0.10
Data	Integration
(Connect	API)
Exactly	Once
Semantics
0.11
Confluent Data Platform 3.2
Source:	Confluent
Apache Kafka - Unix Analogy
$ cat < in.txt | grep "kafka" | tr a-z A-Z > out.txt
Kafka	Connect	API Kafka	Connect	APIKafka	Streams	API
Kafka	Core	(Cluster)
Source:	Confluent
Kafka Core
Kafka High Level Architecture
The who is who
• Producers write data to brokers.
• Consumers read data from
brokers.
• All this is distributed.
The data
• Data is stored in topics.
• Topics are split into partitions,
which are replicated.
Kafka Cluster
Consumer Consumer Consumer
Producer Producer Producer
Broker 1 Broker 2 Broker 3
Zookeeper
Ensemble
Apache Kafka - Architecture
Kafka Broker
Movement
Processor
Movement	Topic
Engine-Metrics	Topic
1 2 3 4 5 6
Engine
Processor1 2 3 4 5 6
Truck
Apache Kafka - Architecture
Kafka Broker
Movement
Processor
Movement	Topic
Engine-Metrics	Topic
1 2 3 4 5 6
Engine
Processor
Partition	0
1 2 3 4 5 6
Partition	0
1 2 3 4 5 6
Partition	1 Movement
Processor
Truck
Apache
Kafka
Kafka Broker 1
Movement
Processor
Truck
Movement	Topic
P	0
Movement
Processor
1 2 3 4 5
P	2 1 2 3 4 5
Kafka Broker 2
Movement	Topic
P	2 1 2 3 4 5
P	1 1 2 3 4 5
Kafka Broker 3
Movement	Topic
P	0 1 2 3 4 5
P	1 1 2 3 4 5
Movement
Processor
Kafka Topics
Creating a topic
• Command line interface
• Using AdminUtils.createTopic method
• Auto-create via auto.create.topics.enable = true
Modifying a topic
https://kafka.apache.org/documentation.html#basic_ops_modify_topic
Deleting a topic
• Command Line interface
$ kafka-topics.sh –zookeeper zk1:2181 --create 
--topic my.topic –-partitions 3 
–-replication-factor 2 --config x=y
Kafka Producer
Write Ahead Log / Commit Log
Producers always append to tail (think append to file)
Order is preserved for messages within same partition
Kafka Broker
Movement	Topic
1 2 3 4 5
Truck
6 6
Kafka Producer – High Level Overview
Producer Client
Kafka Broker
Movement	Topic
Partition	0
Partitioner
Movement Topic
Serializer
Producer	Record
message	1
message	2
message	3
message	4
Batch
Movement	Topic
Partition	1
message	1
message	2
message	3
Batch
Partition	0
Partition	1
Retry
?
Fail
?
Topic
Message
[	Partition	]
[	Key	]
Value
yes
yes
if	can’t	retry:
throw	exception
successful:
return	metadata
Compression	(optional)
Kafka Producer - Durability Guarantees
Producer can configure acknowledgements
Value Description Throughput Latency Durability
0 • Producer	doesn’t	wait	for	leader high low low (no	
guarantee)
1	
(default)
• Producer	waits	for	leader
• Leader	sends ack when	message	
written	to	log
• No	wait	for	followers
medium medium medium	
(leader)
all	(-1) • Producer	waits	for	leader	&	followers
• Leader	sends	ack when all	In-Sync	
Replicas	have	acknowledged
low high high	(ISR)
Kafka Producer - Java API
Constructing a Kafka Producer
private Properties kafkaProps = new Properties();
kafkaProps.put("bootstrap.servers","broker1:9092,broker2:9092");
kafkaProps.put("key.serializer", "...StringSerializer");
kafkaProps.put("value.serializer", "...StringSerializer");
producer = new KafkaProducer<String, String>(kafkaProps);
Kafka Producer - Java API
Sending Message Synchronously (no control if message has been sent successful)
Sending Message Synchronously (wait until reply from Kafka arrives back)
ProducerRecord<String, String> record =
new ProducerRecord<>("topicName", "Key", "Value");
try {
producer.send(record);
} catch (Exception e) {}
ProducerRecord<String, String> record =
new ProducerRecord<>("topicName", "Key", "Value");
try {
producer.send(record).get();
} catch (Exception e) {}
Kafka Producer - Java API
Sending Message Asynchronously
private class DemoProducerCallback implements Callback {
@Override
public void onCompletion(RecordMetadata recordMetadata,
Exception e) {
if (e != null) {
e.printStackTrace();
}
}
}
ProducerRecord<String, String> record = new
ProducerRecord<>("topicName", "key", "value");
producer.send(record, new DemoProducerCallback());
Kafka Consumer - Consumer Groups
• common to have multiple applications
that read data from same topic
• each application should get all of the
messages
• unique consumer group assigned to
each application
• number of consumers (threads) per
group can be different
• Kafka scales to large number of
consumers without impacting
performance
Kafka
Movement	Topic
Partition	0
Consumer Group 1
Consumer	1
Partition	1
Partition	2
Partition	3
Consumer	2
Consumer	3
Consumer	4
Consumer Group 2
Consumer	1
Consumer	2
Kafka Consumer - Consumer Groups
Kafka
Movement	Topic
Partition	0
Consumer Group 1
Consumer	1
Partition	1
Partition	2
Partition	3
Kafka
Movement	Topic
Partition	0
Consumer Group 1
Partition	1
Partition	2
Partition	3
Kafka
Movement	Topic
Partition	0
Consumer Group 1
Partition	1
Partition	2
Partition	3
Kafka
Movement	Topic
Partition	0
Consumer Group 1
Partition	1
Partition	2
Partition	3
Consumer	1
Consumer	2
Consumer	3
Consumer	4
Consumer	1
Consumer	2
Consumer	3
Consumer	4
Consumer	5
Consumer	1
Consumer	2
2	Consumers	/	each	get	messages	from	2	partitions
1	Consumer	/	get	messages	from	all	partitions
5 Consumers	/	one	gets	no	messages
4 Consumers	/	each	get	messages	from	1	partition
Kafka Consumer - Partition offsets
Offset – A sequential id number assigned to messages in the partitions. Uniquely
identifies a message within a partition.
• Consumers track their pointers via (offset, partition, topic) tuples
• Kafka 0.10: seek to offset by given timestamp using method KafkaConsumer#offsetsForTimes
Consumer	Group	A Consumer	Group	B
1 2 3 4 5 6
Consumer	at
“earliest”	offset
Consumer	at
“latest”	offset
New	data
from	Producer
Consumer	at
specific	offset
Kafka Consumer - Java API
Constructing a Kafka Consumer
private Properties kafkaProps = new Properties();
kafkaProps.put("bootstrap.servers","broker1:9092,broker2:9092");
kafkaProps.put("group.id","MovementsConsumerGroup");
kafkaProps.put("key.serializer", "...StringSerializer");
kafkaProps.put("value.serializer", "...StringSerializer");
consumer = new KafkaConsumer<String, String>(kafkaProps);
Kafka Consumer - Java API
Kafka Consumer Poll Loop (with synchronous offset commit)
consumer.subscribe(Collections.singletonList("topic"));
try {
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
// process message, available information:
// record.topic(), record.partition(), record.offset(),
// record.key(), record.value());
}
consumer.commitSync();
}
} finally {
consumer.close();
}
Data Retention – 3 options
1. Never:
2. Time based (TTL): log.retention.{ms | minutes | hours}
3. Size based: log.retention.bytes
4. Log compaction based (entries with same key are removed):
kafka-topics.sh --zookeeper zk:2181 
--create --topic customers 
--replication-factor 1 
--partitions 1 
--config cleanup.policy=compact
Data Retention - Log Compaction
ensures that Kafka always retain at least the last known value for each message key
within a single topic partition
compaction is done in the background by periodically recopying log segments.
0 1 2 3 4 5 6 7 8 9 10 11
K1 K2 K1 K1 K3 K2 K4 K5 K5 K2 K6 K2
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
Offset
Key
Value
3 4 6 8 9 10
K1 K3 K4 K5 K2 K6
V4 V5 V7 V9 V10 V11
Offset
Key
Value
Compaction
Apache Kafka – Some numbers
Kafka at LinkedIn => over 1800+ broker machines / 79K+ Topics
Kafka Performance at our own infrastructure => 6 brokers (VM) / 1 cluster
• 445’622 messages/second
• 31 MB / second
• 3.0405 ms average latency between producer / consumer
1.3	Trillion	messages	per	
day
330	Terabytes	in/day
1.2	Petabytes	out/day
Peak	load	for	a	single	cluster
2	million	messages/sec
4.7	Gigabits/sec	inbound
15	Gigabits/sec	outbound
http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
https://engineering.linkedin.com/kafka/running-kafka-scale
Inspecting the current state of a topic
Use the --describe option
• Leader: brokerID of the currently elected leader broker
• Replica ID’s = broker ID’s
• ISR = "in-sync replica", replicas that are in sync with the leader. In this example:
• Broker 0 is leader for partition 1.
• Broker 1 is leader for partitions 0 and 2.
• All replicas are in-sync with their respective leader partitions.
$ kafka-topics.sh –zookeeper zk1:2181 –-describe --topic my.topic
Topic:zerg2.hydra PartitionCount:3 ReplicationFactor:2 Configs:
Topic: my.topic Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0
Topic: my.topic Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0,1
Topic: my.topic Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1,0
Kafka Connect & Kafka Streams
Kafka Connect - Overview
Source:	Confluent
Kafka Connect - Confluent Connectors
developed, tested, documented and are fully supported by Confluent
Source:	http://www.confluent.io/product/connectors
Kafka Connect - Certified
Connectors
Source:	http://www.confluent.io/product/connectors
• developed by vendors utilizing the Kafka
Connect framework
• meet criteria for code development best
practices, schema registry integration, security,
and documentation
Kafka Connect – Community
Connectors
Source:	http://www.confluent.io/product/connectors
• Other notable Connectors that have been
developed utilizing the Kafka Connect
framework
Kafka Connect – Twitter example
./connect-standalone.sh ../demo-config/connect-simple-source-standalone.properties
../demo-config/twitter-source.properties
name=twitter-source
connector.class=com.eneco.trading.kafka.connect.twitter.TwitterSourceConnector
tasks.max=1
topic=tweets
twitter.consumerkey=<consumer-key>
twitter.consumersecret=<consumer-secret>
twitter.token=<token>
twitter.secret=<token-secret>
track.terms=bigdata
bootstrap.servers=localhost:9095,localhost:9096,localhost:9097
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter
...
Kafka Streams - Overview
• Designed as a simple and lightweight library in Apache
Kafka
• no external dependencies on systems other than Apache
Kafka
• Part of open source Apache Kafka, introduced in 0.10+
• Leverages Kafka as its internal messaging layer
• agnostic to resource management and configuration tools
• Supports fault-tolerant local state
• Event-at-a-time processing (not microbatch) with millisecond
latency
• Windowing with out-of-order data using a Google DataFlow-like
model
Streams API in the context of Kafka
Source:	Confluent
Kafka and "Big Data" / "Fast Data"
Ecosystem
Kafka and the Big Data / Fast Data ecosystem
Kafka integrates with many popular products / frameworks
• Apache Spark Streaming
• Apache Flink
• Apache Storm
• Apache Apex
• Apache NiFi
• StreamSets
• Oracle Stream Analytics
• Oracle Service Bus
• Oracle GoldenGate
• Oracle Event Hub Cloud Service
• Debezium CDC
• …
Additional	Info:	https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem
Kafka in Enterprise Architecture
Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Traditional Big Data Architecture
BI	Tools
Enterprise Data
Warehouse
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
File Import / SQL Import
SQL
Search	/	Explore
Online	&	Mobile	
Apps
Search
NoSQL
Parallel Batch
Processing
Distributed
Filesystem
• Machine	Learning
• Graph	Algorithms
• Natural	Language	Processing
Event
Hub
Event
Hub
Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Event Hub – handle event stream data
BI	Tools
Enterprise Data
Warehouse
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Event
Hub
Call
Center
Weather
Data
Mobile
Apps
SQL
Search	/	Explore
Online	&	Mobile	
Apps
Search
Data Flow
NoSQL
Parallel Batch
Processing
Distributed
Filesystem
• Machine	Learning
• Graph	Algorithms
• Natural	Language	Processing
Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Event Hub – taking Velocity into account
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Call
Center
Mobile
Apps
Batch Analytics
Streaming Analytics
Results
Parallel Batch
Processing
Distributed
Filesystem
Stream Analytics
NoSQL
Reference /
Models
SQL
Search
Dashboard
BI	Tools
Enterprise Data
Warehouse
Search	/	Explore
Online	&	Mobile	
Apps
File Import / SQL Import
Weather
Data
Event
Hub
Event
Hub
Event
Hub
Container
Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Event Hub – Asynchronous Microservice Architecture
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Call
Center
Mobile
Apps
Parallel
Batch
ProcessingDistributed
Filesystem
Microservice
NoSQLRDBMS
SQL
Search
BI	Tools
Enterprise Data
Warehouse
Search	/	Explore
Online	&	Mobile	
Apps
File Import / SQL Import
Weather
Data
{		}
API
Event
Hub
Event
Hub
Event
Hub
Summary
Summary
• Kafka can scale to millions of messages per second, and more
• Provides the central backbone for your events/messages
• Easy to start with (Proof of Concept)
• Vibrant community and ecosystem
• Fast paced technology
• Confluent provides Confluent Platform with Support
• Confluent Cloud provides a managed Kafka service on AWS (later Google & Azure)
• Oracle Event Hub Service offers a Kafka Managed Service
Guido Schmutz
Technology Manager
guido.schmutz@trivadis.com
@gschmutz guidoschmutz.wordpress.com

More Related Content

What's hot

Apache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing PlatformApache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing Platform
Guido Schmutz
 
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaKafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Guido Schmutz
 
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaKafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Guido Schmutz
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
Guido Schmutz
 

What's hot (20)

Microservices with Kafka Ecosystem
Microservices with Kafka EcosystemMicroservices with Kafka Ecosystem
Microservices with Kafka Ecosystem
 
Apache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing PlatformApache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing Platform
 
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaKafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
 
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & PartitioningApache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
 
Ingesting streaming data into Graph Database
Ingesting streaming data into Graph DatabaseIngesting streaming data into Graph Database
Ingesting streaming data into Graph Database
 
Internet of Things (IoT) - in the cloud or rather on-premises?
Internet of Things (IoT) - in the cloud or rather on-premises?Internet of Things (IoT) - in the cloud or rather on-premises?
Internet of Things (IoT) - in the cloud or rather on-premises?
 
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaKafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
Internet of Things (IoT) - in the cloud or rather on-premises?
Internet of Things (IoT) - in the cloud or rather on-premises?Internet of Things (IoT) - in the cloud or rather on-premises?
Internet of Things (IoT) - in the cloud or rather on-premises?
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
 
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Building event-driven (Micro)Services with Apache Kafka
Building event-driven (Micro)Services with Apache Kafka Building event-driven (Micro)Services with Apache Kafka
Building event-driven (Micro)Services with Apache Kafka
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 

Similar to Apache Kafka - Scalable Message Processing and more!

Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
Guido Schmutz
 

Similar to Apache Kafka - Scalable Message Processing and more! (20)

Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
 
I can't believe it's not a queue: Kafka and Spring
I can't believe it's not a queue: Kafka and SpringI can't believe it's not a queue: Kafka and Spring
I can't believe it's not a queue: Kafka and Spring
 
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
 
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
 
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Kafka for data scientists
Kafka for data scientistsKafka for data scientists
Kafka for data scientists
 
Kafka Explainaton
Kafka ExplainatonKafka Explainaton
Kafka Explainaton
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
 
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Jug - ecosystem
Jug -  ecosystemJug -  ecosystem
Jug - ecosystem
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
 
JHipster conf 2019 - Kafka Ecosystem
JHipster conf 2019 - Kafka EcosystemJHipster conf 2019 - Kafka Ecosystem
JHipster conf 2019 - Kafka Ecosystem
 
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
 
CODEONTHEBEACH_Streaming Applications with Apache Pulsar
CODEONTHEBEACH_Streaming Applications with Apache PulsarCODEONTHEBEACH_Streaming Applications with Apache Pulsar
CODEONTHEBEACH_Streaming Applications with Apache Pulsar
 

More from Guido Schmutz

Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
Guido Schmutz
 
Location Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaLocation Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using Kafka
Guido Schmutz
 
Streaming Visualisation
Streaming VisualisationStreaming Visualisation
Streaming Visualisation
Guido Schmutz
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Guido Schmutz
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
Guido Schmutz
 

More from Guido Schmutz (20)

30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code
 
Event Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureEvent Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data Architecture
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data Architecture
 
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureEvent Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
 
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache Kafka
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
 
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaSolutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
 
What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
Location Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaLocation Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using Kafka
 
Streaming Visualisation
Streaming VisualisationStreaming Visualisation
Streaming Visualisation
 
Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
 
Fundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureFundamentals Big Data and AI Architecture
Fundamentals Big Data and AI Architecture
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Apache Kafka - Scalable Message Processing and more!

  • 1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH Apache Kafka Scalable Message Processing and more! Guido Schmutz – 20.6.2017 @gschmutz guidoschmutz.wordpress.com
  • 2. Guido Schmutz Working at Trivadis for more than 20 years Oracle ACE Director for Fusion Middleware and SOA Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Head of Trivadis Architecture Board Technology Manager @ Trivadis More than 30 years of software development experience Contact: guido.schmutz@trivadis.com Blog: http://guidoschmutz.wordpress.com Slideshare: http://www.slideshare.net/gschmutz Twitter: gschmutz Apache Kafka - Scalable Message Processing and more!
  • 3. COPENHAGEN MUNICH LAUSANNE BERN ZURICH BRUGG GENEVA HAMBURG DÜSSELDORF FRANKFURT STUTTGART FREIBURG BASEL VIENNA With over 600 specialists and IT experts in your region. 14 Trivadis branches and more than 600 employees 200 Service Level Agreements Over 4,000 training participants Research and development budget: CHF 5.0 million Financially self-supporting and sustainably profitable Experience from more than 1,900 projects per year at over 800 customers
  • 4. Agenda 1. Introduction & Motivation 2. Kafka Core 3. Kafka Connect & Kafka Streams 4. Kafka and "Big Data" / "Fast Data" Ecosystem 5. Kafka in Enterprise Architecture 6. Summary
  • 6. Apache Kafka - Motivation LinkedIn’s motivation for Kafka was: • "A unified platform for handling all the real-time data feeds a large company might have." Must haves • High throughput to support high volume event feeds • Support real-time processing of these feeds to create new, derived feeds. • Support large data backlogs to handle periodic ingestion from offline systems • Support low-latency delivery to handle more traditional messaging use cases • Guarantee fault-tolerance in the presence of machine failures
  • 7. Apache Kafka - Overview Distributed publish-subscribe messaging system Designed for processing of real time activity stream data (logs, metrics collections, social media streams, …) Initially developed at LinkedIn, now part of Apache Does not use JMS API and standards Kafka maintains feeds of messages in topics
  • 8. Apache Kafka History 2012 2013 2014 2015 2016 2017 Cluster mirroring, data compression Intra-cluster replication 0.7 0.8 0.9 Data Processing (Streams API) 0.10 Data Integration (Connect API) Exactly Once Semantics 0.11
  • 9. Confluent Data Platform 3.2 Source: Confluent
  • 10. Apache Kafka - Unix Analogy $ cat < in.txt | grep "kafka" | tr a-z A-Z > out.txt Kafka Connect API Kafka Connect APIKafka Streams API Kafka Core (Cluster) Source: Confluent
  • 12. Kafka High Level Architecture The who is who • Producers write data to brokers. • Consumers read data from brokers. • All this is distributed. The data • Data is stored in topics. • Topics are split into partitions, which are replicated. Kafka Cluster Consumer Consumer Consumer Producer Producer Producer Broker 1 Broker 2 Broker 3 Zookeeper Ensemble
  • 13. Apache Kafka - Architecture Kafka Broker Movement Processor Movement Topic Engine-Metrics Topic 1 2 3 4 5 6 Engine Processor1 2 3 4 5 6 Truck
  • 14. Apache Kafka - Architecture Kafka Broker Movement Processor Movement Topic Engine-Metrics Topic 1 2 3 4 5 6 Engine Processor Partition 0 1 2 3 4 5 6 Partition 0 1 2 3 4 5 6 Partition 1 Movement Processor Truck
  • 15. Apache Kafka Kafka Broker 1 Movement Processor Truck Movement Topic P 0 Movement Processor 1 2 3 4 5 P 2 1 2 3 4 5 Kafka Broker 2 Movement Topic P 2 1 2 3 4 5 P 1 1 2 3 4 5 Kafka Broker 3 Movement Topic P 0 1 2 3 4 5 P 1 1 2 3 4 5 Movement Processor
  • 16. Kafka Topics Creating a topic • Command line interface • Using AdminUtils.createTopic method • Auto-create via auto.create.topics.enable = true Modifying a topic https://kafka.apache.org/documentation.html#basic_ops_modify_topic Deleting a topic • Command Line interface $ kafka-topics.sh –zookeeper zk1:2181 --create --topic my.topic –-partitions 3 –-replication-factor 2 --config x=y
  • 17. Kafka Producer Write Ahead Log / Commit Log Producers always append to tail (think append to file) Order is preserved for messages within same partition Kafka Broker Movement Topic 1 2 3 4 5 Truck 6 6
  • 18. Kafka Producer – High Level Overview Producer Client Kafka Broker Movement Topic Partition 0 Partitioner Movement Topic Serializer Producer Record message 1 message 2 message 3 message 4 Batch Movement Topic Partition 1 message 1 message 2 message 3 Batch Partition 0 Partition 1 Retry ? Fail ? Topic Message [ Partition ] [ Key ] Value yes yes if can’t retry: throw exception successful: return metadata Compression (optional)
  • 19. Kafka Producer - Durability Guarantees Producer can configure acknowledgements Value Description Throughput Latency Durability 0 • Producer doesn’t wait for leader high low low (no guarantee) 1 (default) • Producer waits for leader • Leader sends ack when message written to log • No wait for followers medium medium medium (leader) all (-1) • Producer waits for leader & followers • Leader sends ack when all In-Sync Replicas have acknowledged low high high (ISR)
  • 20. Kafka Producer - Java API Constructing a Kafka Producer private Properties kafkaProps = new Properties(); kafkaProps.put("bootstrap.servers","broker1:9092,broker2:9092"); kafkaProps.put("key.serializer", "...StringSerializer"); kafkaProps.put("value.serializer", "...StringSerializer"); producer = new KafkaProducer<String, String>(kafkaProps);
  • 21. Kafka Producer - Java API Sending Message Synchronously (no control if message has been sent successful) Sending Message Synchronously (wait until reply from Kafka arrives back) ProducerRecord<String, String> record = new ProducerRecord<>("topicName", "Key", "Value"); try { producer.send(record); } catch (Exception e) {} ProducerRecord<String, String> record = new ProducerRecord<>("topicName", "Key", "Value"); try { producer.send(record).get(); } catch (Exception e) {}
  • 22. Kafka Producer - Java API Sending Message Asynchronously private class DemoProducerCallback implements Callback { @Override public void onCompletion(RecordMetadata recordMetadata, Exception e) { if (e != null) { e.printStackTrace(); } } } ProducerRecord<String, String> record = new ProducerRecord<>("topicName", "key", "value"); producer.send(record, new DemoProducerCallback());
  • 23. Kafka Consumer - Consumer Groups • common to have multiple applications that read data from same topic • each application should get all of the messages • unique consumer group assigned to each application • number of consumers (threads) per group can be different • Kafka scales to large number of consumers without impacting performance Kafka Movement Topic Partition 0 Consumer Group 1 Consumer 1 Partition 1 Partition 2 Partition 3 Consumer 2 Consumer 3 Consumer 4 Consumer Group 2 Consumer 1 Consumer 2
  • 24. Kafka Consumer - Consumer Groups Kafka Movement Topic Partition 0 Consumer Group 1 Consumer 1 Partition 1 Partition 2 Partition 3 Kafka Movement Topic Partition 0 Consumer Group 1 Partition 1 Partition 2 Partition 3 Kafka Movement Topic Partition 0 Consumer Group 1 Partition 1 Partition 2 Partition 3 Kafka Movement Topic Partition 0 Consumer Group 1 Partition 1 Partition 2 Partition 3 Consumer 1 Consumer 2 Consumer 3 Consumer 4 Consumer 1 Consumer 2 Consumer 3 Consumer 4 Consumer 5 Consumer 1 Consumer 2 2 Consumers / each get messages from 2 partitions 1 Consumer / get messages from all partitions 5 Consumers / one gets no messages 4 Consumers / each get messages from 1 partition
  • 25. Kafka Consumer - Partition offsets Offset – A sequential id number assigned to messages in the partitions. Uniquely identifies a message within a partition. • Consumers track their pointers via (offset, partition, topic) tuples • Kafka 0.10: seek to offset by given timestamp using method KafkaConsumer#offsetsForTimes Consumer Group A Consumer Group B 1 2 3 4 5 6 Consumer at “earliest” offset Consumer at “latest” offset New data from Producer Consumer at specific offset
  • 26. Kafka Consumer - Java API Constructing a Kafka Consumer private Properties kafkaProps = new Properties(); kafkaProps.put("bootstrap.servers","broker1:9092,broker2:9092"); kafkaProps.put("group.id","MovementsConsumerGroup"); kafkaProps.put("key.serializer", "...StringSerializer"); kafkaProps.put("value.serializer", "...StringSerializer"); consumer = new KafkaConsumer<String, String>(kafkaProps);
  • 27. Kafka Consumer - Java API Kafka Consumer Poll Loop (with synchronous offset commit) consumer.subscribe(Collections.singletonList("topic")); try { while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) { // process message, available information: // record.topic(), record.partition(), record.offset(), // record.key(), record.value()); } consumer.commitSync(); } } finally { consumer.close(); }
  • 28. Data Retention – 3 options 1. Never: 2. Time based (TTL): log.retention.{ms | minutes | hours} 3. Size based: log.retention.bytes 4. Log compaction based (entries with same key are removed): kafka-topics.sh --zookeeper zk:2181 --create --topic customers --replication-factor 1 --partitions 1 --config cleanup.policy=compact
  • 29. Data Retention - Log Compaction ensures that Kafka always retain at least the last known value for each message key within a single topic partition compaction is done in the background by periodically recopying log segments. 0 1 2 3 4 5 6 7 8 9 10 11 K1 K2 K1 K1 K3 K2 K4 K5 K5 K2 K6 K2 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 Offset Key Value 3 4 6 8 9 10 K1 K3 K4 K5 K2 K6 V4 V5 V7 V9 V10 V11 Offset Key Value Compaction
  • 30. Apache Kafka – Some numbers Kafka at LinkedIn => over 1800+ broker machines / 79K+ Topics Kafka Performance at our own infrastructure => 6 brokers (VM) / 1 cluster • 445’622 messages/second • 31 MB / second • 3.0405 ms average latency between producer / consumer 1.3 Trillion messages per day 330 Terabytes in/day 1.2 Petabytes out/day Peak load for a single cluster 2 million messages/sec 4.7 Gigabits/sec inbound 15 Gigabits/sec outbound http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines https://engineering.linkedin.com/kafka/running-kafka-scale
  • 31. Inspecting the current state of a topic Use the --describe option • Leader: brokerID of the currently elected leader broker • Replica ID’s = broker ID’s • ISR = "in-sync replica", replicas that are in sync with the leader. In this example: • Broker 0 is leader for partition 1. • Broker 1 is leader for partitions 0 and 2. • All replicas are in-sync with their respective leader partitions. $ kafka-topics.sh –zookeeper zk1:2181 –-describe --topic my.topic Topic:zerg2.hydra PartitionCount:3 ReplicationFactor:2 Configs: Topic: my.topic Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0 Topic: my.topic Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0,1 Topic: my.topic Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1,0
  • 32. Kafka Connect & Kafka Streams
  • 33. Kafka Connect - Overview Source: Confluent
  • 34. Kafka Connect - Confluent Connectors developed, tested, documented and are fully supported by Confluent Source: http://www.confluent.io/product/connectors
  • 35. Kafka Connect - Certified Connectors Source: http://www.confluent.io/product/connectors • developed by vendors utilizing the Kafka Connect framework • meet criteria for code development best practices, schema registry integration, security, and documentation
  • 36. Kafka Connect – Community Connectors Source: http://www.confluent.io/product/connectors • Other notable Connectors that have been developed utilizing the Kafka Connect framework
  • 37. Kafka Connect – Twitter example ./connect-standalone.sh ../demo-config/connect-simple-source-standalone.properties ../demo-config/twitter-source.properties name=twitter-source connector.class=com.eneco.trading.kafka.connect.twitter.TwitterSourceConnector tasks.max=1 topic=tweets twitter.consumerkey=<consumer-key> twitter.consumersecret=<consumer-secret> twitter.token=<token> twitter.secret=<token-secret> track.terms=bigdata bootstrap.servers=localhost:9095,localhost:9096,localhost:9097 key.converter=org.apache.kafka.connect.storage.StringConverter value.converter=org.apache.kafka.connect.storage.StringConverter ...
  • 38. Kafka Streams - Overview • Designed as a simple and lightweight library in Apache Kafka • no external dependencies on systems other than Apache Kafka • Part of open source Apache Kafka, introduced in 0.10+ • Leverages Kafka as its internal messaging layer • agnostic to resource management and configuration tools • Supports fault-tolerant local state • Event-at-a-time processing (not microbatch) with millisecond latency • Windowing with out-of-order data using a Google DataFlow-like model
  • 39. Streams API in the context of Kafka Source: Confluent
  • 40. Kafka and "Big Data" / "Fast Data" Ecosystem
  • 41. Kafka and the Big Data / Fast Data ecosystem Kafka integrates with many popular products / frameworks • Apache Spark Streaming • Apache Flink • Apache Storm • Apache Apex • Apache NiFi • StreamSets • Oracle Stream Analytics • Oracle Service Bus • Oracle GoldenGate • Oracle Event Hub Cloud Service • Debezium CDC • … Additional Info: https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem
  • 42. Kafka in Enterprise Architecture
  • 43. Hadoop Clusterd Hadoop Cluster Big Data Cluster Traditional Big Data Architecture BI Tools Enterprise Data Warehouse Billing & Ordering CRM / Profile Marketing Campaigns File Import / SQL Import SQL Search / Explore Online & Mobile Apps Search NoSQL Parallel Batch Processing Distributed Filesystem • Machine Learning • Graph Algorithms • Natural Language Processing
  • 44. Event Hub Event Hub Hadoop Clusterd Hadoop Cluster Big Data Cluster Event Hub – handle event stream data BI Tools Enterprise Data Warehouse Location Social Click stream Sensor Data Billing & Ordering CRM / Profile Marketing Campaigns Event Hub Call Center Weather Data Mobile Apps SQL Search / Explore Online & Mobile Apps Search Data Flow NoSQL Parallel Batch Processing Distributed Filesystem • Machine Learning • Graph Algorithms • Natural Language Processing
  • 45. Hadoop Clusterd Hadoop Cluster Big Data Cluster Event Hub – taking Velocity into account Location Social Click stream Sensor Data Billing & Ordering CRM / Profile Marketing Campaigns Call Center Mobile Apps Batch Analytics Streaming Analytics Results Parallel Batch Processing Distributed Filesystem Stream Analytics NoSQL Reference / Models SQL Search Dashboard BI Tools Enterprise Data Warehouse Search / Explore Online & Mobile Apps File Import / SQL Import Weather Data Event Hub Event Hub Event Hub
  • 46. Container Hadoop Clusterd Hadoop Cluster Big Data Cluster Event Hub – Asynchronous Microservice Architecture Location Social Click stream Sensor Data Billing & Ordering CRM / Profile Marketing Campaigns Call Center Mobile Apps Parallel Batch ProcessingDistributed Filesystem Microservice NoSQLRDBMS SQL Search BI Tools Enterprise Data Warehouse Search / Explore Online & Mobile Apps File Import / SQL Import Weather Data { } API Event Hub Event Hub Event Hub
  • 48. Summary • Kafka can scale to millions of messages per second, and more • Provides the central backbone for your events/messages • Easy to start with (Proof of Concept) • Vibrant community and ecosystem • Fast paced technology • Confluent provides Confluent Platform with Support • Confluent Cloud provides a managed Kafka service on AWS (later Google & Azure) • Oracle Event Hub Service offers a Kafka Managed Service