We share our experience with Apache Kafka for event-driven collaboration in microservices-based architecture. Talk was a part of Meetup: https://www.meetup.com/de-DE/Apache-Kafka-Germany-Munich/events/236402498/
6. 6
Orchestration vs. Choreography – Business Process
Apache Kafka Lessons Learned @ PAYBACK
Sam Newman 2015, Building Microservicess, O'Reilly
7. 7
Orchestration: Synchronous
Apache Kafka Lessons Learned @ PAYBACK
Sam Newman 2015, Building Microservicess, O'Reilly
• Easy to map code to business process
• Immediate Feedback about every stage
• Atomic Execution
• Customer Service becomes central place of logic
• Leads to "God" Services
• Tight coupling, high cost of changes
• Resilience is complex (think retries, scaling…)
PRO CON
8. 8
Choreography: Asynchronous, Event-driven
Apache Kafka Lessons Learned @ PAYBACK
*Sam Newman 2015, Building Microservicess, O'Reilly
• Easier to achieve Resilience and Performance
• More decoupled
• distributed logic
• Higher flexibility (changes, scaling)
• Higher implementation effort & complexity
• Additional work for monitoring and tracking
• Additional SPOF
PRO CON
9. 9
Resilience concerns the whole system
Lose coupling helps implement resilience patterns, but you need to care about:
○ delivery and processing semantics
○ retries and fallback strategy
○ handle timeouts and other communication errors
○ transaction handling
○ no silver bullet pattern for all event types
Apache Kafka Lessons Learned @ PAYBACK
Resilience is about an ability to fully recover from failure - to self-heal
10. 10
Choosing the right tool
○ NFRs may be specific to the Event type
○ Delivery semantic depend on Event type
- at most once
- at least once
- exactly once
○ Events order for some use cases can be important (FIFO)
○ Reprocessing must be possible
○ Monitoring and alerting must be well supported (APIs) due to the increased complexity
○ …
Apache Kafka Lessons Learned @ PAYBACK
We need to consider
12. 12
I have a joke about an event…
Apache Kafka Lessons Learned @ PAYBACK
…But you might not get it
INCIDENTS
13. 13
Cluster outage
Apache Kafka Lessons Learned @ PAYBACK
-VMs stalled during snapshot backups leading to Cluster reconnects
-in 9/10 cases recovery worked
-in 1/10 cases this lead to a single broker outside the cluster which still had
partitions assigned (luckily refused writes because of missing replicas)
Deactivate Backups!
Consider physical machines!
14. 14
"A first sign of the beginning of
understanding is the wish to die. "
Franz Kafka
Apache Kafka Lessons Learned @ PAYBACK
Von Atelier Jacobi: Sigismund Jacobi (1860–1935) - http://www.bodleian.ox.ac.uk/news/2008_july_02, Gemeinfrei, https://commons.wikimedia.org/w/index.php?curid=5428566
16. 16
Producer
○ Producer uses non-blocking async API
○ Tow options for checking for failures:
- Immediately block for response: send().get()
- Do followup work in Callback
- Be careful about handling failures
○ Don’t forget to close the producer! producer.close() will block until in-flight transactions complete
○ acks – set to all
○ batch.size – set to 0
○ retries (defaults to 0) - think about increasing this value
- Not all errors are automatically retriable . Think about custom error handling on producer side!
- retry may affect message ordering
Apache Kafka Lessons Learned @ PAYBACK
Implementation
Configuration
17. 17
Consumer
o Note: Consumer is single threaded – one consumer per thread
o disable auto commit (autocommit.enable = false)
o commit using OffsetAndMetadata and not committing everything
o rollback with seek -> you need to know your last committed message -> implement Rebalance Listener
o rollback (seek) after errors in offset commit
o change default max.partition.fetch.bytes (1MB can lead to session timeout in < 0.10.X)
o event processing should be idempotent – be prepared to handle duplicates
o think about event reprocessing (how to change offset, how to recreate event etc)
Apache Kafka Lessons Learned @ PAYBACK
Recommendations
18. 18
Other basic configuration
o Acks = all
o Block.on.buffer.full = true
o Producer Retries = MAX_INT
o ( Max.inflight.requests.per.connect = 1 )
o Producer.close()
o Replication-factor >= 3
o Min.insync.replicas = 2
o Unclean.leader.election = false
o Auto.offset.commit = false
o Commit after processing
o Monitor!
Apache Kafka Lessons Learned @ PAYBACK
Be Safe, Not Sorry
21. 21
Kafka-Manager: Open Source UI/API Kafka Mgmt Tool
Apache Kafka Lessons Learned @ PAYBACK
https://github.com/yahoo/kafka-manager
• Good for current cluster status and
ad-hoc analysis
• Provides a status API (HTTP)
• Consumers only displayed during
active consumption
• 0.10.x support still not merged
23. 23
Burrow: API only Consumer Lag Checking
Apache Kafka Lessons Learned @ PAYBACK
{
error: false,
message: "consumer group status returned",
status: {
cluster: "vp2",
group: "mdeAppGroup",
status: "ERR",
complete: false,
partitions: [
{
topic: "memberDataChanges",
partition: 1,
status: "STOP",
start: {
offset: 1775109,
timestamp: 1485253978439,
lag: 0
},
end: {
offset: 1775127,
timestamp: 1485254054861,
lag: 1
}
},},
[…]
totallag: 8
},
request:
{url: "/v2/kafka/vp2/consumer/mdeAppGroup/lag",
host: "hqiqlpxxap89",
cluster: "vp2",
group: "mdeAppGroup",
topic: ""
}
}
curl –XGET http://burrow/v2/kafka/vp2/consumer/mdeAppGroup/lag
No Thresholds required
Alerting via email and HTTP POST
Issue: Calculate lag at request time, not commit time
24. 24
"God gives the nuts, but he does not crack them."
Franz Kafka
PAYBACK GmbH
Maxim Schelest
Thomas Falkenberg
Theresienhöhe 12
80339 München
Phone +49 (0) 89 997 41 – 0
PAYBACK.net | PAYBACK.de