Building add-ons for Atlassian products today means building a Connect add-on and running it as a service in your own infrastructure, or a PaaS provider’s infrastructure, or (more commonly) a set of microservices. While this has many benefits, the transition from monolithic to distributed systems brings with it additional failure modes that simply do not manifest in the world of local function calls. Join Atlassian developer Diego Berrueta for a walk-through of 5 resilience techniques that will help keep your services rock-solid in the face of unreliable, slow, or faulty systems.
Diego Berrueta, Engineering Principal, Atlassian
9. Fault-tolerant system
Continues to operate in the event of faults in some
of its components
Fault
Deviation from the normal state, either due to
internal or external causes
Failure
Observable impact of a fault in a system, usually
manifests as reduced availability
10. Fault-tolerant system
Continues to operate in the event of faults in some
of its components
Fault
Deviation from the normal state, either due to
internal or external causes
Failure
Observable impact of a fault in a system, usually
manifests as reduced availability
11. Fault-tolerant system
Continues to operate in the event of faults in some
of its components
Fault
Deviation from the normal state, either due to
internal or external causes
Failure
Observable impact of a fault in a system, usually
manifests as reduced availability
12. Bob Yeats (OSU); Flickr (www.flickr.com/photos/oregonstateuniversity/), CC-by
Faults happen
Preventing them may be technically or
economically impractical
13. Systems can be designed and
built to tolerate faults
FAULT TOLERANCE
20. Asynchronous communication
Sender and receiver fail independently,
receiver can catch up later
Synchronous communication
Propagates failures, context may be lost
23. Avoid SPOF
Find the components which
compromise the system
How to contain faults
Invest in redundancy
Improve availability by having
more than one of everything
Build bulkheads
Set up logic walls to
reduce the blast radius
35. Decline service
When overloaded, ask clients
to come back later
How to fail fast
Never wait long
Set a timeout for blocking
calls and slow operations
Validate early
Avoid starting something that
cannot be completed
39. Fault-tolerance libraries
Circuit breaking
Avoid cascading failures during
periods of turbulence
Monitoring and alerting
Observe the behaviour of all your
dependencies
Timeouts
Time-bound any operation
Fall-back
Recover using an alternative path
42. Anticipate failure
If it is not going to work,
do not even try
How to escape
Degrade gracefully
A cached result or a default
value may be an alternative
Detect problems
Compare all interactions
against error thresholds
51. Insist smartly
Transient errors can be
retried with back-off
How to adjust
Report availability
Apply back pressure to
prevent congestion
Negotiate size
Limit the cost of the job
59. Monitor and alert
Understand the behaviour of
the system in production
How to learn
Reflect on incidents
Analyse the root cause and
prevent recurrences
Test what if…?
Deliberately introduce chaos
to assess fault-tolerance
61. Hope is not a strategy
Test, observe, reflect
Life starts after releasing
“Code complete” is not “production ready”
Be cynical
Do not trust anybodyRobustness
is an attitude
62. Some disasters
can be prevented
Build and test with failure in mind
Faults are unavoidable
Any possible fault
will eventually happen