This talk is aimed to give developers who are interested to scale their streaming application with Exactly-Once (EOS) guarantees. Since the original release, EOS processing has received wide adoption as a much needed feature inside the community, and has also exposed various scalability and usability issues when applied in production systems.
To address those issues, we improved on the existing EOS model by integrating static Producer transaction semantics with dynamic Consumer group semantics. We will have a deep-dive into the newly added features (KIP-447), from which the audience will have more insight into the scalability v.s. semantics guarantees tradeoffs and how Kafka Streams specifically leveraged them to help scale EOS streaming applications written in this library. We would also present how the EOS code can be simplified with plain Producer and Consumer. Come to learn more if you wish to adopt this improved EOS feature and get started on building your own EOS application today!
Similar to Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and Scalability (Boyang Chen & Guozhang Wang, Confluent) Kafka Summit 2020
Building a Scalable Real-Time Fleet Management IoT Data Tracker with Kafka St...HostedbyConfluent
Similar to Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and Scalability (Boyang Chen & Guozhang Wang, Confluent) Kafka Summit 2020 (20)
13. 13
The Kafka Approach for Exactly Once
1) Idempotent writes in order within a single topic partition
2) Transactional writes across multiple output topic partitions
3) Guarantee single writer for any input topic partitions
[KIP-98, KIP-129]
14. 14
The Kafka Approach for Exactly Once
1) Idempotent writes in order within a single topic partition
2) Transactional writes across multiple output topic partitions
3) Guarantee single writer for any input topic partitions
[KIP-98, KIP-129]
39. 39
The Kafka Approach for Exactly Once:
1) Idempotent writes in order within a single topic partition
2) Transactional writes across multiple output topic partitions
3) Guarantee single writer for any input topic partitions
47. 47
When Taking Over the Partition:
1) The previous txn must have completed commit or abort so there are no
concurrent transactions.
2) Other clients will be fenced write processing results for those input
partitions, a.k.a we have a “single writer”.
48. 48
Transactional ID: defines single writer scope
1) Configured by the unique producer `transactional.id` property.
2) Enforced fencing by a monotonic epoch for each id.
3) Producer initialization await pending transaction completion.
49. 49
Transaction ID: defines single writer scope
1) Configured by the unique producer `transactional.id` property.
2) Enforced fencing by a monotonic epoch for each id.
3) Producer initialization await pending transaction completion.
50. 50
Transaction ID: defines single writer scope
1) Configured by the unique producer `transactional.id` property.
2) Enforced fencing by a monotonic epoch for each id.
3) Producer initialization await pending transaction completion.
57. 57
Consumer Group
txn.Id = A, epoch = 1
txn.Id = B, epoch = 2
Num. producer transaction IDs ~= num. input partitions
Producers need to be dynamically created when rebalance
65. 65
What problems
are KIP-447
solving ?
● Make one producer per process model work
● Unblock technical challenges
○ Commit fencing
○ Concurrent transaction
67. 67
● We are fencing on the transactional producer side,
which assumes a static partition assignment
What problems
are KIP-447
solving ?
● Commit fencing
● Concurrent Transaction
68. 68
● We are fencing on the transactional producer side,
which assumes a static partition assignment
● Consumer group partition assignments are dynamic
in practice
What problems
are KIP-447
solving ?
● Commit fencing
● Concurrent Transaction
75. 75
● We are fencing on the transactional producer side,
which assumes a static partition assignment
● Consumer group partition assignments are dynamic
in practice
● Action: fence zombie producer commit
What problems
are KIP-447
solving ?
● Commit fencing
● Concurrent Transaction
76. 76
● We are fencing on the transactional producer side,
which assumes a static partition assignment
● Consumer group partition assignments are dynamic
in practice
● Action: fence zombie producer commit
○ Different from epoch fencing
○ Utilize consumer group generation ~= epoch
What problems
are KIP-447
solving ?
● Commit fencing
● Concurrent Transaction
84. 84
● We are fencing on the transactional producer side,
which assumes a static partition assignment
● Consumer group partition assignments are dynamic
in practice
● Action: fence zombie producer commit
○ Different from epoch fencing
○ Utilize consumer group generation ~= epoch
● Add new APIs
What problems
are KIP-447
solving ?
● Commit fencing
● Concurrent Transaction
87. 87
● We are fencing on the transactional producer side,
which assumes a static partition assignment
● Consumer group partition assignments are dynamic
in practice
● Action: fence zombie producer commit
○ Different from epoch fencing
○ Utilize consumer group generation ~= epoch
● Add new APIs
○ Expose group generation through
consumer#groupMetadata()
○ Commit transaction with consumer metadata through
producer#sendOffsetsToTransaction(offsets,
groupMetadata)
What problems
are KIP-447
solving ?
● Commit fencing
● Concurrent Transaction
90. 90
What problems
are KIP-447
solving ?
● Commit fencing
● Concurrent Transaction
● Only one open transaction allowed for each input
partition
● Offset commit is the only critical section
91. 91
What problems
are KIP-447
solving ?
● Commit fencing
● Concurrent Transaction
● Only one open transaction allowed for each input
partition
● Offset commit is the only critical section
○ Observed: Pending offsets could be completed later,
causing duplicate processing
92. 92
What problems
are KIP-447
solving ?
● Commit fencing
● Concurrent Transaction
● Only one open transaction allowed for each input
partition
● Offset commit is the only critical section
○ Observed: Pending offsets could be completed later,
causing duplicate processing
○ Observed: consumer always needs to fetch offset after
rebalance
93. 93
What problems
are KIP-447
solving ?
● Commit fencing
● Concurrent Transaction
● Only one open transaction allowed for each input
partition
● Offset commit is the only critical section
○ Observed: Pending offsets could be completed later,
causing duplicate processing
○ Observed: consumer always needs to fetch offset after
rebalance
○ Action: OffsetFetchRequest will back-off until pending
commits are cleared, either by previous transaction
complete or timeout
101. 101
447 Summary
● Resolve the semantic mismatch between producer
and consumer
○ Commit Fencing
○ Concurrent Transaction
102. 102
447 Summary
● Resolve the semantic mismatch between producer
and consumer
○ Commit Fencing
○ Concurrent Transaction
● Make the one producer per processing unit possible
103. 103
Number of Input
Partitions
At Least Once
Growth of Producers
Number of
Applications
5 10 15 20 25 30
600
500
400
300
200
1
00
Exactly Once
104. 104
Number of Input
Partitions
At Least Once
Growth of Producers
Number of
Applications
5 10 15 20 25 30
600
500
400
300
200
1
00
Exactly Once
Exactly Once After 447
105. 105
447 Summary
● Resolve the semantic mismatch between producer
and consumer
○ Commit Fencing
○ Concurrent Transaction
● Make the one producer per processing unit possible
● Next we will talk about our scale testing result, and
how to turn on 447 for Kafka Streams, a major EOS
adopter
115. 115
Upgrade Procedure
● Rolling bounce brokers to >= Apache Kafka 2.5
● Upgrade the stream application binary and keep the
PROCESSING_GUARATNEE setting at "exactly_once". Do the first rolling
bounce, and make sure the group is stable with every instance on 2.6 binary.
116. 116
Upgrade Procedure
● Rolling bounce brokers to >= Apache Kafka 2.5
● Upgrade the stream application binary and keep the
PROCESSING_GUARATNEE setting at "exactly_once". Do the first rolling
bounce, and make sure the group is stable with every instance on 2.6 binary.
● Upgrade the PROCESSING_GUARANTEE setting to "exactly_once_beta" and do
a second rolling bounce to starts using new thread producer for EOS.
118. 118
Kafka has an elegant transaction model which has held up well.
119. 119
Kafka has an elegant transaction model which has held up well.
● Address hardness with usability by fixing the
producer/consumer semantic mismatch
120. 120
Kafka has an elegant transaction model which has held up well.
● Address hardness with usability by fixing the
producer/consumer semantic mismatch
● Breakthrough on the scalability
121. 121
Kafka has an elegant transaction model which has held up well.
● Address hardness with usability by fixing the
producer/consumer semantic mismatch
● Breakthrough on the scalability
● How to opt-in new EOS model on Kafka Streams