Today almost all systems are distributed and have complex interactions between each other to provide useful functionality. In a software system, resilience is the ability to recover to a working condition after being affected by a serious incident. Ballerina has inbuilt functionality to make programs resilient for network failures. This slide deck explores how to build resilience patterns with Ballerina.
6. But when he started to implement the application,
bob found issues with some of the legacy services.
Transient
Network Failures
Moderate Load
Intermittent
Failures
7. Bob found it very
difficult to build a
reliable application.
10. Ability to return to the
original form, position after
being affected by a particular
alteration
What is Resilience?
11. In a software system, resilience means ...
… the ability to recover to a working condition after being
affected by a serious incident
Resilience in Software Applications
12. “The probability of failure-free software operation for a
specified period of time in a specified environment.”
- The IEEE Reliability Society
• 100% operational all the time
Reliability and Resilience
Reliability
http://www.picserver.org/r/reliability.html
14. Distributed and complex systems
with many interactions are prone
to failures
Why Focusing on Reliability is Not
Enough
Systems are Complex and Prone to Failures
15. • Untested corner cases
• Minor mistakes can affect serious production
incidents
• Failures are unpredictable
Why Focusing on Reliability is Not
Enough
Avoiding Failures is Not Practical
16. • Handle unexpected situations
• When one feature is temporarily unavailable, the rest
of the application still runs
• Stop propagating errors happening at downstreams of
a complex system into upstreams
Resilience in Production
17. It’s All About Achieving
Availability of a Production
System!
18. Best case:
• User get’s a 100% availability of the
service
Typical case:
• User sees a graceful degradation of the
service
What Does it Mean to a User?
19. • Never expect systems to be 100% reliable
• Design systems thinking about connection issues,
down times, etc.
What Does it Mean to a developer ?
21. Isolate components of an application into multiple pools.
If one component fails, others will continue to service
Bulkhead
Isolation
22. • Transient failures are not uncommon
• They recover by themselves
• Can be handled by
– Cancel
– Retry
– Retry with a delay
Retry
https://www.flickr.com/photos/markgregory/8184890333
https://creativecommons.org/licenses/by-nc-sa/2.0/legalcode
23. • Hide downstream latency and keep
the responsiveness to upstream
• Prevent waiting forever
Timeout
24. • Some transient failures takes
much longer to recover
• Repeatedly retrying may
hinder recoverability
• Retry up to a certain degree
and cut off
Circuit Breaker
25. Circuit Breaker
Fail/Keep Open
Reset Timeout
Fail
Success
Fail
(threshold not reached)
Fail
(threshold exceeded)
Success
Open
Half-OpenClosed
States in
circuit breaker
41. Offering Different Quality of Services
Bulkhead
Banking
Service
Reliable
Service
Service got
Transient
Failures
Standard
Customer
Priority
Customer
42. Offering Different Quality of Services
Bulkhead
Reliable
Service
Service with
Transient Issues
Priority
Check
Function
43. Conclusion
● What is Resilience
● Reliability and resilience
● Resilience patterns
● Building resilience patterns with Ballerina