Handling failures is important, but it’s a must have when a product handles sensitive data. It also becomes exponentially harder in the world of microservices, since a failure can happen in any of the services and even in their dependencies.
One of the go-to solutions when handling errors is to simply “Retry”. However, in a complex system, A “Retrying” mechanism must be Smart, Customisable, Pluggable, and Persistent.
In this talk, we'll discuss what capabilities a ""GOOD"" retrying mechanism should have, how I implemented a Kafka-based retrying solution that answers those capabilities, can handle a large and diverse set of errors and why Kafka is a good match for such a solution.
4. About Me
Hi, My
Name is
Asaf Halili
I’m a
Software
Developer
I work @ Oribi,
a web
analytics
startup
Hello World!
5. About Me
Hi, My
Name is
Asaf Halili
I’m a
Software
Developer
I work @ Oribi,
a web
analytics
startup
I’m a technology
enthusiast and a
hobbyist
photographer
Hello World!
15. How an Ideal Solution Looks Like
Smart
● Act upon failures automatically
● Allow sufficient time for failures to repair
16. How an Ideal Solution Looks Like
Smart
● Act upon failures automatically
● Allow sufficient time for failures to repair
Persistent
17. How an Ideal Solution Looks Like
Smart
● Act upon failures automatically
● Allow sufficient time for failures to repair
Persistent
● In the era of cloud & orchestration platforms, applications
can be volatile.
31. ● Decoupling
○ Separate the first retries from the second retries
and so forth. It allows us to handle them differently.
■ Observability
■ Retry Pace
■ etc.
Why Do We Need Another Topic?
36. but what would happened
If the DB
Was unavailable again?
37. Saving the Method Context to a DLQ topic
DLQ Kafka Topic
Publish the method context
As a kafka message:
{
“fullName”: “Asaf Halili”,
“mathGrade”: 85
}
Kafka Producer
39. A DLQ Topic
● DLQ is a Dead Letter Queue
DLQ Topic
Producer
Consumer
X
40. A DLQ Topic
● DLQ is a Dead Letter Queue
● It’s purpose is to save messages
(in our case, method contexts)
that can’t be handled automatically
DLQ Topic
Producer
Consumer
X
41. A DLQ Topic
● DLQ is a Dead Letter Queue
● It’s purpose is to save messages
(in our case, method contexts)
that can’t be handled automatically
● The messages in this topic will be analyzed manually.
DLQ Topic
Producer
Consumer
X
42. High Level Architecture
Students Service
Retry 1
Producer
Consumer
Retry 2
Producer
Consumer
DLQ
Producer
DLQ
Kafka Topic
Retry2
Kafka Topic
Retry1
Kafka Topic
saveStudent
46. The Solution Capabilities
Smart
● Multiple retries with exponential backoff to allow
failures enough time to repair
● DLQ topic for manual analysis
47. The Solution Capabilities
Smart
● Multiple retries with exponential backoff to allow
failures enough time to repair
● DLQ topic for manual analysis
Persistent
48. The Solution Capabilities
Smart
● Multiple retries with exponential backoff to allow
failures enough time to repair
● DLQ topic for manual analysis
Persistent
● We use Kafka as our persistence layer
52. The Solution Capabilities
Pluggable
● Implemented as a method annotation
@OribiKafkaRetrying
public void methodToRetryOnFailure(String name) {
System.out.println("Hello " + name);
}
53. Helpful Resources
● Uber’s Reliable Reprocessing -
https://eng.uber.com/reliable-reprocessing/
The solution I described is based on Uber’s article.
● Handle Failures In A Complex Microservices
Architecture -
https://techblog.oribi.io/tech-blog/handle-failures-in-co
mplex-microservices-architecture
54. Thanks!
Feel free to contact me :-)
asaf.halili@gmail.com
linkedin.com/in/asafhalili
@asafhalili