SlideShare ist ein Scribd-Unternehmen logo
1 von 39
1
When it absolutely, positively,
has to be there
Reliability Guarantees in Apache Kafka
@jeffholoman @gwenshap
4
Apache Kafka
High Throughput
Low Latency
Scalable
Centralized
Real-time
5
Streaming Platform
Producer Consumer
Streaming Applications
Connectors Connectors
Apache
Kafka
6
Versions of Apache Kafka
• 0.7.0 <- Please don’t
• 0.8.0 <- Replication exists, it will continue evolving with every release
• 0.8.2 <- New producer, offset commits to Kafka
• 0.9.0 <- New consumer, Connect APIs
• 0.10.0 <- New consumer improvements, Streams APIs
• 0.11.0 <- Idempotent producer, transactional semantics, Exactly once.
7
Kafka Components
• Broker
• Java clients:
• Producer
• Consumers
• Kafka Streams
• Kafka Connect
• Non-Java:
• Librdkafka
• Librdkafka based – Python, Go, NodeJS, C#...
• Others
8
If Kafka is a critical piece of our pipeline
 Can we be 100% sure that our data will get there?
 Can we lose messages?
 How do we verify?
 Who’s fault is it?
10
Distributed Systems
 Things Fail
 Systems are designed to
tolerate failure
 We must expect failures
and design our code and
configure our systems to
handle them
11
Network
Broker MachineClient Machine
Data Flow
Kafka Client
Broker
O/S Socket Buffer
NIC
NIC
Page Cache
Disk
Application
Thread
O/S Socket Buffercallbac
k
✗
✗
✗
✗
✗
✗
✗✗ data
ack /
exception
Replication
12
Client Machine
Kafka Client
O/S Socket Buffer
NIC
Application Thread
✗
✗ ✗Broker Machine
Broker
NIC
Page Cache
Disk
O/S Socket Buffer
✗
✗
✗
✗
Network
Data Flow
✗
data
offsets
ZK
Kafka✗
13
Kafka is super reliable.
Stores data, on disk. Replicated.
… if you know how to configure it
that way.
14
Replication is your friend
 Kafka protects against failures by replicating data
 The unit of replication is the partition
 One replica is designated as the Leader
 Follower replicas fetch data from the leader
 The leader holds the list of “in-sync” replicas
15
Replication and ISRs
0
1
2
0
1
2
0
1
2
Producer
Broker
100
Broker
101
Broker
102
Topic:
Partitions
:
Replicas:
my_topic
3
3
Partition
:
Leader:
ISR:
1
101
100,102
Partition
:
Leader:
ISR:
2
102
101,100
Partition
:
Leader:
ISR:
0
100
101,102
16
ISR
2 things make a replica in-sync
• Lag behind leader
• replica.lag.time.max.ms – replica that didn’t fetch or is behind
• replica.lag.max.messages – will go away has gone away in 0.9
• Connection to Zookeeper
17
Terminology
Acked
• Producers will not retry sending.
• Depends on producer setting
Committed
• Only when message got to all ISR
(future leaders have it).
• Consumers can read.
• replica.lag.time.max.ms controls: how long can a
dead replica prevent consumers from reading?
Committed Offsets
• Consumer told Kafka the latest offsets it read. By
default the consumer will not see these events
again.
18
Replication
Acks = all
• Waits for all in-sync replicas to reply.
Replica 3
100
Replica 2
100
Replica 1
100
Time
19
Replica 3 stopped replicating for some reason
Replication
Replica 3
100
Replica 2
100
101
Replica 1
100
101
Time
Acked in acks =
all
“committed”
Acked in acks =
1
but not
“committed”
20
Replication
Replica 3
100
Replica 2
100
101
Replica 1
100
101
Time
One replica drops out of ISR, or goes offline
All messages are now acked and committed
21
2nd Replica drops out, or is offline
Replication
Replica 3
100
Replica 2
100
101
Replica 1
100
101
102
103
104Time
22
Replication
Replica 3
100
Replica 2
100
101
Replica 1
100
101
102
103
104Time
Now we’re in trouble
✗
23
Replication
If Replica 2 or 3 come back online before the leader, you can will lose data.
Replica 3
100
Replica 2
100
101
Replica 1
100
101
102
103
104Time
All those are
“acked” and
“committed”
24
So what to do
Disable Unclean Leader Election
• unclean.leader.election.enable = false
• Default from 0.11.0
Set replication factor
• default.replication.factor = 3
Set minimum ISRs
• min.insync.replicas = 2
25
Warning
min.insync.replicas is applied at the topic-level.
Must alter the topic configuration manually if created before the server level change
Must manually alter the topic < 0.9.0 (KAFKA-2114)
26
Replication
Replication = 3
Min ISR = 2
Replica 3
100
Replica 2
100
Replica 1
100
Time
27
Replication
Replica 3
100
Replica 2
100
101
Replica 1
100
101
Time
One replica drops out of ISR, or goes offline
28
Replication
Replica 3
100
Replica 2
100
101
Replica 1
100
101102
103
104
Time
2nd Replica fails out, or is out of sync
Buffers
in
Produc
er
29
30
Producer Internals
Producer sends batches of messages to a buffer
M3
Application
Thread
Application
Thread
Application
Thread
send()
M2 M1 M0
Batch 3
Batch 2
Batch 1
Fail
?
response
retry
Update
Future
callback
drain
Metadata or
Exception
31
Basics
Durability can be configured with the producer configuration request.required.acks
• 0 The message is written to the network (buffer)
• 1 The message is written to the leader
• all The producer gets an ack after all ISRs receive the data; the message is committed
Make sure producer doesn’t just throws messages away!
• block.on.buffer.full = true < 0.9.0
• max.block.ms = Long.MAX_VALUE
• Or handle the BufferExhaustedException / TimeoutException yourself
32
“New” Producer
All calls are non-blocking async
2 Options for checking for failures:
• Immediately block for response: send().get()
• Do followup work in Callback, close producer after error threshold
• Be careful about buffering these failures. Future work? KAFKA-1955
• Don’t forget to close the producer! producer.close() will block until in-flight txns complete
retries (producer config) defaults to 0
In flight requests could lead to message re-ordering
33
34
Consumer
Three choices for Consumer API
• Simple Consumer
• High Level Consumer (ZookeeperConsumer)
• New KafkaConsumer
35
New Consumer – auto commit
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "10000");
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
processAndUpdateDB(record);
}
}
What if we crash
after 8 seconds?
Commit automatically
every 10 seconds
36
New Consumer – manual commit
props.put("enable.auto.commit", "false");
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
processAndUpdateDB(record);
consumer.commitSync();
}
Commit entire
batch outside the
loop!
43
Minimize Duplicates for At Least Once Consuming
1. Commit your own offsets - Set autocommit.enable = false
2. Use Rebalance Listener to limit duplicates
3. Make sure you commit only what you are done processing
4. Note: New consumer is single threaded – one consumer per thread.
44
Exactly Once Semantics
At most once is easy
At least once is not bad either – commit after 100% sure data is safe
Exactly once is tricky
• Commit data and offsets in one transaction
• Idempotent producer
Kafka Connect – many connectors (especially Confluent’s) are exactly once
by using an external database to write events and store offsets in one transaction
Kafka Streams – starting at 0.11.0 have easy to configure exactly once
(exactly.once=true).
Other stream processing systems – have their own thing.
47
How we test Kafka?
"""Replication tests.
These tests verify that replication provides simple durability guarantees by checking that data
acked by
brokers is still available for consumption in the face of various failure scenarios.
Setup: 1 zk, 3 kafka nodes, 1 topic with partitions=3, replication-factor=3, and
min.insync.replicas=2
- Produce messages in the background
- Consume messages in the background
- Drive broker failures (shutdown, or bounce repeatedly with kill -15 or kill -9)
- When done driving failures, stop producing, and finish consuming
- Validate that every acked message was consumed
"""
48
Monitoring for Data Loss
• Monitor for producer errors – watch the retry numbers
• Monitor consumer lag – MaxLag or via offsets
• Standard schema:
• Each message should contain timestamp and originating service and host
• Each producer can report message counts and offsets to a special topic
• “Monitoring consumer” reports message counts to another special topic
• “Important consumers” also report message counts
• Reconcile the results
49
Be Safe, Not Sorry
Acks = all
Max.block.ms = Long.MAX_VALUE
Retries = MAX_INT
( Max.inflight.requests.per.connection = 1 )
Producer.close()
Replication-factor >= 3
Min.insync.replicas = 2
Unclean.leader.election = false
Auto.offset.commit = false
Commit after processing
Monitor!
50
Thank You!

Weitere ähnliche Inhalte

Was ist angesagt?

Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloudconfluent
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
 
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...confluent
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka IntroductionAmita Mirajkar
 
Troubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionTroubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionJoel Koshy
 
Reducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive StreamsReducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive Streamsjimriecken
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Christopher Curtin
 
Kafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User GroupKafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User GroupJeff Holoman
 
... No it's Apache Kafka!
... No it's Apache Kafka!... No it's Apache Kafka!
... No it's Apache Kafka!makker_nl
 
Let the alpakka pull your stream
Let the alpakka pull your streamLet the alpakka pull your stream
Let the alpakka pull your streamEnno Runne
 
Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015Joel Koshy
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaGrant Henke
 
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
Kafka Summit NYC 2017 - Deep Dive Into Apache KafkaKafka Summit NYC 2017 - Deep Dive Into Apache Kafka
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafkaconfluent
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 
Apache Kafka - Free Friday
Apache Kafka - Free FridayApache Kafka - Free Friday
Apache Kafka - Free FridayOtávio Carvalho
 

Was ist angesagt? (20)

Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
 
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka aws
Kafka awsKafka aws
Kafka aws
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Troubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionTroubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolution
 
Reducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive StreamsReducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive Streams
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
 
Kafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User GroupKafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User Group
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
... No it's Apache Kafka!
... No it's Apache Kafka!... No it's Apache Kafka!
... No it's Apache Kafka!
 
Let the alpakka pull your stream
Let the alpakka pull your streamLet the alpakka pull your stream
Let the alpakka pull your stream
 
Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
 
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
Kafka Summit NYC 2017 - Deep Dive Into Apache KafkaKafka Summit NYC 2017 - Deep Dive Into Apache Kafka
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Apache Kafka - Free Friday
Apache Kafka - Free FridayApache Kafka - Free Friday
Apache Kafka - Free Friday
 

Ähnlich wie Kafka reliability velocity 17

Apache Kafka Reliability
Apache Kafka Reliability Apache Kafka Reliability
Apache Kafka Reliability Jeff Holoman
 
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015 Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015 Jeff Holoman
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producerconfluent
 
Seek and Destroy Kafka Under Replication
Seek and Destroy Kafka Under ReplicationSeek and Destroy Kafka Under Replication
Seek and Destroy Kafka Under ReplicationHostedbyConfluent
 
Apache Kafka
Apache KafkaApache Kafka
Apache KafkaJoe Stein
 
Webinar patterns anti patterns
Webinar patterns anti patternsWebinar patterns anti patterns
Webinar patterns anti patternsconfluent
 
Apache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-PatternApache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-Patternconfluent
 
Apache Kafka Women Who Code Meetup
Apache Kafka Women Who Code MeetupApache Kafka Women Who Code Meetup
Apache Kafka Women Who Code MeetupSnehal Nagmote
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
World of Tanks Experience of Using Kafka
World of Tanks Experience of Using KafkaWorld of Tanks Experience of Using Kafka
World of Tanks Experience of Using KafkaLevon Avakyan
 
Distributed messaging with Apache Kafka
Distributed messaging with Apache KafkaDistributed messaging with Apache Kafka
Distributed messaging with Apache KafkaSaumitra Srivastav
 
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018Otávio Carvalho
 
Kafka practical experience
Kafka practical experienceKafka practical experience
Kafka practical experienceRico Chen
 
Monitoring Apache Kafka
Monitoring Apache KafkaMonitoring Apache Kafka
Monitoring Apache Kafkaconfluent
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProChester Chen
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdfTarekHamdi8
 
Messaging for Modern Applications
Messaging for Modern ApplicationsMessaging for Modern Applications
Messaging for Modern ApplicationsTom McCuch
 

Ähnlich wie Kafka reliability velocity 17 (20)

Apache Kafka Reliability
Apache Kafka Reliability Apache Kafka Reliability
Apache Kafka Reliability
 
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015 Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
 
Seek and Destroy Kafka Under Replication
Seek and Destroy Kafka Under ReplicationSeek and Destroy Kafka Under Replication
Seek and Destroy Kafka Under Replication
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Webinar patterns anti patterns
Webinar patterns anti patternsWebinar patterns anti patterns
Webinar patterns anti patterns
 
intro-kafka
intro-kafkaintro-kafka
intro-kafka
 
Apache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-PatternApache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-Pattern
 
Apache Kafka Women Who Code Meetup
Apache Kafka Women Who Code MeetupApache Kafka Women Who Code Meetup
Apache Kafka Women Who Code Meetup
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Kafka Explainaton
Kafka ExplainatonKafka Explainaton
Kafka Explainaton
 
World of Tanks Experience of Using Kafka
World of Tanks Experience of Using KafkaWorld of Tanks Experience of Using Kafka
World of Tanks Experience of Using Kafka
 
Distributed messaging with Apache Kafka
Distributed messaging with Apache KafkaDistributed messaging with Apache Kafka
Distributed messaging with Apache Kafka
 
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018
 
Kafka practical experience
Kafka practical experienceKafka practical experience
Kafka practical experience
 
Monitoring Apache Kafka
Monitoring Apache KafkaMonitoring Apache Kafka
Monitoring Apache Kafka
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdf
 
Messaging for Modern Applications
Messaging for Modern ApplicationsMessaging for Modern Applications
Messaging for Modern Applications
 

Mehr von Gwen (Chen) Shapira

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep DiveGwen (Chen) Shapira
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote Gwen (Chen) Shapira
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGwen (Chen) Shapira
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebookGwen (Chen) Shapira
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Gwen (Chen) Shapira
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupGwen (Chen) Shapira
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Gwen (Chen) Shapira
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings MeetupGwen (Chen) Shapira
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupGwen (Chen) Shapira
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupGwen (Chen) Shapira
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupGwen (Chen) Shapira
 

Mehr von Gwen (Chen) Shapira (20)

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service mesh
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebook
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 

Kürzlich hochgeladen

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 

Kürzlich hochgeladen (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 

Kafka reliability velocity 17

  • 1. 1 When it absolutely, positively, has to be there Reliability Guarantees in Apache Kafka @jeffholoman @gwenshap
  • 2. 4 Apache Kafka High Throughput Low Latency Scalable Centralized Real-time
  • 3. 5 Streaming Platform Producer Consumer Streaming Applications Connectors Connectors Apache Kafka
  • 4. 6 Versions of Apache Kafka • 0.7.0 <- Please don’t • 0.8.0 <- Replication exists, it will continue evolving with every release • 0.8.2 <- New producer, offset commits to Kafka • 0.9.0 <- New consumer, Connect APIs • 0.10.0 <- New consumer improvements, Streams APIs • 0.11.0 <- Idempotent producer, transactional semantics, Exactly once.
  • 5. 7 Kafka Components • Broker • Java clients: • Producer • Consumers • Kafka Streams • Kafka Connect • Non-Java: • Librdkafka • Librdkafka based – Python, Go, NodeJS, C#... • Others
  • 6. 8 If Kafka is a critical piece of our pipeline  Can we be 100% sure that our data will get there?  Can we lose messages?  How do we verify?  Who’s fault is it?
  • 7. 10 Distributed Systems  Things Fail  Systems are designed to tolerate failure  We must expect failures and design our code and configure our systems to handle them
  • 8. 11 Network Broker MachineClient Machine Data Flow Kafka Client Broker O/S Socket Buffer NIC NIC Page Cache Disk Application Thread O/S Socket Buffercallbac k ✗ ✗ ✗ ✗ ✗ ✗ ✗✗ data ack / exception Replication
  • 9. 12 Client Machine Kafka Client O/S Socket Buffer NIC Application Thread ✗ ✗ ✗Broker Machine Broker NIC Page Cache Disk O/S Socket Buffer ✗ ✗ ✗ ✗ Network Data Flow ✗ data offsets ZK Kafka✗
  • 10. 13 Kafka is super reliable. Stores data, on disk. Replicated. … if you know how to configure it that way.
  • 11. 14 Replication is your friend  Kafka protects against failures by replicating data  The unit of replication is the partition  One replica is designated as the Leader  Follower replicas fetch data from the leader  The leader holds the list of “in-sync” replicas
  • 13. 16 ISR 2 things make a replica in-sync • Lag behind leader • replica.lag.time.max.ms – replica that didn’t fetch or is behind • replica.lag.max.messages – will go away has gone away in 0.9 • Connection to Zookeeper
  • 14. 17 Terminology Acked • Producers will not retry sending. • Depends on producer setting Committed • Only when message got to all ISR (future leaders have it). • Consumers can read. • replica.lag.time.max.ms controls: how long can a dead replica prevent consumers from reading? Committed Offsets • Consumer told Kafka the latest offsets it read. By default the consumer will not see these events again.
  • 15. 18 Replication Acks = all • Waits for all in-sync replicas to reply. Replica 3 100 Replica 2 100 Replica 1 100 Time
  • 16. 19 Replica 3 stopped replicating for some reason Replication Replica 3 100 Replica 2 100 101 Replica 1 100 101 Time Acked in acks = all “committed” Acked in acks = 1 but not “committed”
  • 17. 20 Replication Replica 3 100 Replica 2 100 101 Replica 1 100 101 Time One replica drops out of ISR, or goes offline All messages are now acked and committed
  • 18. 21 2nd Replica drops out, or is offline Replication Replica 3 100 Replica 2 100 101 Replica 1 100 101 102 103 104Time
  • 19. 22 Replication Replica 3 100 Replica 2 100 101 Replica 1 100 101 102 103 104Time Now we’re in trouble ✗
  • 20. 23 Replication If Replica 2 or 3 come back online before the leader, you can will lose data. Replica 3 100 Replica 2 100 101 Replica 1 100 101 102 103 104Time All those are “acked” and “committed”
  • 21. 24 So what to do Disable Unclean Leader Election • unclean.leader.election.enable = false • Default from 0.11.0 Set replication factor • default.replication.factor = 3 Set minimum ISRs • min.insync.replicas = 2
  • 22. 25 Warning min.insync.replicas is applied at the topic-level. Must alter the topic configuration manually if created before the server level change Must manually alter the topic < 0.9.0 (KAFKA-2114)
  • 23. 26 Replication Replication = 3 Min ISR = 2 Replica 3 100 Replica 2 100 Replica 1 100 Time
  • 24. 27 Replication Replica 3 100 Replica 2 100 101 Replica 1 100 101 Time One replica drops out of ISR, or goes offline
  • 25. 28 Replication Replica 3 100 Replica 2 100 101 Replica 1 100 101102 103 104 Time 2nd Replica fails out, or is out of sync Buffers in Produc er
  • 26. 29
  • 27. 30 Producer Internals Producer sends batches of messages to a buffer M3 Application Thread Application Thread Application Thread send() M2 M1 M0 Batch 3 Batch 2 Batch 1 Fail ? response retry Update Future callback drain Metadata or Exception
  • 28. 31 Basics Durability can be configured with the producer configuration request.required.acks • 0 The message is written to the network (buffer) • 1 The message is written to the leader • all The producer gets an ack after all ISRs receive the data; the message is committed Make sure producer doesn’t just throws messages away! • block.on.buffer.full = true < 0.9.0 • max.block.ms = Long.MAX_VALUE • Or handle the BufferExhaustedException / TimeoutException yourself
  • 29. 32 “New” Producer All calls are non-blocking async 2 Options for checking for failures: • Immediately block for response: send().get() • Do followup work in Callback, close producer after error threshold • Be careful about buffering these failures. Future work? KAFKA-1955 • Don’t forget to close the producer! producer.close() will block until in-flight txns complete retries (producer config) defaults to 0 In flight requests could lead to message re-ordering
  • 30. 33
  • 31. 34 Consumer Three choices for Consumer API • Simple Consumer • High Level Consumer (ZookeeperConsumer) • New KafkaConsumer
  • 32. 35 New Consumer – auto commit props.put("enable.auto.commit", "true"); props.put("auto.commit.interval.ms", "10000"); KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props); consumer.subscribe(Arrays.asList("foo", "bar")); while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) { processAndUpdateDB(record); } } What if we crash after 8 seconds? Commit automatically every 10 seconds
  • 33. 36 New Consumer – manual commit props.put("enable.auto.commit", "false"); KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props); consumer.subscribe(Arrays.asList("foo", "bar")); while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) processAndUpdateDB(record); consumer.commitSync(); } Commit entire batch outside the loop!
  • 34. 43 Minimize Duplicates for At Least Once Consuming 1. Commit your own offsets - Set autocommit.enable = false 2. Use Rebalance Listener to limit duplicates 3. Make sure you commit only what you are done processing 4. Note: New consumer is single threaded – one consumer per thread.
  • 35. 44 Exactly Once Semantics At most once is easy At least once is not bad either – commit after 100% sure data is safe Exactly once is tricky • Commit data and offsets in one transaction • Idempotent producer Kafka Connect – many connectors (especially Confluent’s) are exactly once by using an external database to write events and store offsets in one transaction Kafka Streams – starting at 0.11.0 have easy to configure exactly once (exactly.once=true). Other stream processing systems – have their own thing.
  • 36. 47 How we test Kafka? """Replication tests. These tests verify that replication provides simple durability guarantees by checking that data acked by brokers is still available for consumption in the face of various failure scenarios. Setup: 1 zk, 3 kafka nodes, 1 topic with partitions=3, replication-factor=3, and min.insync.replicas=2 - Produce messages in the background - Consume messages in the background - Drive broker failures (shutdown, or bounce repeatedly with kill -15 or kill -9) - When done driving failures, stop producing, and finish consuming - Validate that every acked message was consumed """
  • 37. 48 Monitoring for Data Loss • Monitor for producer errors – watch the retry numbers • Monitor consumer lag – MaxLag or via offsets • Standard schema: • Each message should contain timestamp and originating service and host • Each producer can report message counts and offsets to a special topic • “Monitoring consumer” reports message counts to another special topic • “Important consumers” also report message counts • Reconcile the results
  • 38. 49 Be Safe, Not Sorry Acks = all Max.block.ms = Long.MAX_VALUE Retries = MAX_INT ( Max.inflight.requests.per.connection = 1 ) Producer.close() Replication-factor >= 3 Min.insync.replicas = 2 Unclean.leader.election = false Auto.offset.commit = false Commit after processing Monitor!

Hinweis der Redaktion

  1. Apache Kafka is no longer just pub-sub messaging. Because of its persistence and reliability, it makes a great place to manage general streams of events and to drive streaming applications.
  2. We are going to start by discussing reliability guarantees as implemented by the broker’s replication protocol. We then discuss how to configure the clients for better reliability. We’ll use Java clients as an example. For non-Java clients: The C client (librdkafka) works pretty much the same way – same configurations and guarantees will work. Same for clients in other languages based on Librdkafka. For other clients… its hard to make generalizations. Some are very different and the advice in this talk will not work for them.
  3. Low Level Diagram: Not talking about producer / consumer design yet…maybe this is too low-level though Show diagram of network send -> os socket -> NIC -> ---- NIC -> Os socket buffer -> socket -> internal message flow / socket server -> response back to client -> how writes get persisted to disk including os buffers, async write etc Then overlay places where things can fail.
  4. Low Level Diagram: Not talking about producer / consumer design yet…maybe this is too low-level though Show diagram of network send -> os socket -> NIC -> ---- NIC -> Os socket buffer -> socket -> internal message flow / socket server -> response back to client -> how writes get persisted to disk including os buffers, async write etc Then overlay places where things can fail.
  5. Highlight boxes with different color
  6. When Replica 3 is back, it will catch up
  7. Kafka exposes it’s binary TCP protocol via a Java api which is what we’ll be discussing here. So everything in the box is what’s happening inside the producer. Generally speaking, you have an application thread, or threads that take individual messages and “send” them to Kafka. What happens under the covers is that these messages are batched up where possible in order to amortize the overhead of the send, stored in a buffer and communicated over to kafka. After Kafka has completed it’s work, a response is returned back for each message. This happens asynchronously, using Java’s concurrent API. This response is comprised of either an exception or a metadata record. If the metadata is returned, which contains the offset, partition and topic, then things are good and we continue processing. However, if an error has returned the producer will automatically retry the failed message, up to a configurable # or amount of time. When this exception occurs and we have retries enabled, these retries actually just go right back to the start of the batches being prepared to send back to Kafka.
  8. Commit every 10 seconds, but we don’t really have any control over what’s processed, and this can lead to duplicates
  9. If you are doing too much work commits don’t count as heartbeat ->
  10. So lets so we have auto-commit enabled, and we are chugging along, and counting on the consumer to commit our offsets for us. This is great because we don’t have to code anything, and don’t have think about the frequency of commits and the impact that might have on our throughput. Life is good. But now we’ve lost a thread or a process. And we don’t really know where we are in the processing, Because the last auto-commit committed stuff that we hadn’t actually written to disk.
  11. So now we’re in a situation where we think we’ve read all of our data but we will have gaps in data. Note the same risk applies if we lose a partition or broker and get a new leader. OR