Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
The 7 quests of resilient software design
A guide for the adventurous software engineer

Uwe Friedrichsen – codecentric AG...
Uwe Friedrichsen

IT traveller.
Connecting the dots.
Attracted by uncharted territory.
CTO at codecentric.

https://www.sl...
You want to do resilient software design ...
... and you expect everything to be like this
But somehow it feels more like that ...
... or even that
What the **** went wrong?
The road to resilience is a twisted one
“7 quests you must complete!”
Quest #1
Understand the business case
“How much money will we earn with it?”
“Does it improve our velocity?”
Resilience is not about making money
Resilience is about not losing money
Lack of resilient software design
Reduced system availability
Users cannot do what they intend to do
Less transactions per...
Quest #2
Embrace distributed systems
Everything fails, all the time.
-- Werner Vogels
If X then Y
What we learned in our IT education
If X then maybe Y



This changes
everything!
What we need for distributed...
Yet, we usually use deterministic thinking
to reason about distributed systems
Failures in distributed systems ...

•  Crash failure
•  Omission failure
•  Timing failure
•  Response failure
•  Byzanti...
... turn seemingly simple issues into very hard ones
Time & Ordering

Leslie Lamport

"Time, clocks, and the
ordering of e...
Embrace distributed systems

•  Distributed systems introduce non-determinism regarding
•  Execution completeness
•  Messa...
But do I really need to care?
(The system, I am working on, is not a distributed system)
(Almost) every system is a distributed system
-- Chas Emerick


http://www.infoq.com/presentations/problems-distributed-sy...
… and it’s getting “worse”


•  Cloud-based systems
•  Microservices
•  Zero Downtime
•  Mobile & IoT
•  Social Web
Quest #3
Avoid the “100% available” trap
The “100% available” trap, version #1

You: “How should the application respond if a technical failure occurs?”

Business ...
The “100% available” trap, version #2

You: “How do you handle the situation if the service you call does not

 
respond (...
The question is not, if a failure will happen
The question is, when a failure will happen
A short note about availability

Assume a service availability of 99,5% (incl. planned downtimes)

•  10 services involved...
Quest #4
Establish the ops-dev feedback loop
The big wall between Dev and Ops
In a distributed environment, you cannot solve
availability issues on an infrastructure level only
Dev
 Ops
“I implemented something to
improve production availability”
“Here are the figures
how it worked”
Continuous impro...
Dev
 Ops
“I implemented something to
improve production availability”
“Here are the figures
how it worked”
Continuous impro...
For effective resilient software design
you need a working ops-dev feedback loop
Establishing the feedback loop


•  Adopt DevOps
•  Adopt Site Reliability Engineering (SRE)
•  Or do it your own way if y...
Quest #5
Master functional design
Without proper functional design
nothing else matters
Isolation

•  System must not fail as a whole
•  Split system in parts and isolate parts against each other
•  Avoid casca...
Bulkhead

•  Bulkheads implement the “parts” that need to be isolated
•  Core isolation pattern (a.k.a. “failure units” or...
Hmm, sound easy. Why should that be hard?
Service A
 Service B
Request
Due to functional design, Service A
always needs backing from Service B
to be able to answer ...
Service
Request
Due to functional design we need
to call a lot of services to be able
to answer a client request,

i.e. av...
Mothership Service

(a.k.a. Monolith)
Request
By trying to avoid the aforementioned
issues we ended up with cramming all
r...
Let us apply our well-known best practices

•  Divide & conquer a.k.a. functional decomposition
•  DRY (Don’t Repeat Yours...
Unfortunately ...
Service A
 Service B
Request
Due to functional design, Service A
always needs backing from Service B
to be able to answer ...
Service
Request
Due to functional design we need
to call a lot of services to be able
to answer a client request,

i.e. av...
Mothership Service

(a.k.a. Monolith)
Request
By trying to avoid the aforementioned
issues we ended up with cramming all
r...
Welcome to distributed hell!
Caches to the rescue!
Service A
 Service B
Request
Due to functional design, Service A
always needs backing from Service B
to be able to answer ...
Caches to the rescue?
Do you really think
that copying stale data all over your system
is a suitable measure
to fix an inherently broken design? ...
We have to re-learn design
for distributed system
No silver bullet
Yet, a few guiding thoughts ...
Foundations of design

•  “High cohesion, low coupling” & “separation of concerns”
•  “Crucial across process boundaries
•...
Short activation paths

•  Long activation paths affect availability
•  Increase likelihood of failures
•  Minimize remote...
Be (extremely) wary of reusability

•  Reusability increases coupling
•  Reusability usually leads to bad service design
•...
Quest #6
Know your toolbox
Core
Detect
 Treat
Prevent
Recover
Mitigate
 Complement
Supporting
patterns
Redundancy
Stateless
Idempotency
Escalation
Ze...
Using resilience patterns

•  Patterns are options, not obligations
•  Don’t pick too many patterns
•  Each pattern increa...
How other people did it
Core
Detect
 Treat
Prevent
Recover
Mitigate
 Complement
Supporting
patterns
Escalation
Communication
paradigm
Isolation
Sy...
Core
Detect
 Treat
Prevent
Recover
Mitigate
 Complement
Supporting
patterns
Fallback
Share load
Bounded
queue
Communicatio...
Quest #7
Preserve the collective memory
We face a new generation of developers
every 5 years
We loose our collective memory
every 5 years *


* Mean time until a topic discussion in the community starts over form sc...
Time working in IT
Growth of
knowledge
Depth of
insights
What do we do to compensate this effect?
We look for the new & shiny stuff ...
... as anything not new must be useless crap!
We need to rediscover our insights
every 5 years
In IT, we suffer from
continuous collective amnesia
and we are even proud of it
How can we become better?
Wrap-up
The 7 quests at a glance
Wrap-up

•  The road to resilient software design is a twisted one!
•  Most challenges are only indirectly related to RSD
...
Uwe Friedrichsen

IT traveller.
Connecting the dots.
Attracted by uncharted territory.
CTO at codecentric.

https://www.sl...
Nächste SlideShare
Wird geladen in …5
×

The 7 quests of resilient software design

3.068 Aufrufe

Veröffentlicht am

This is basically a "lessons learned" talk. While dealing with resilient software design for several years meanwhile, I realized along the way that implementing a specific pattern like timeout detection, circuit breaker, back-pressure, etc. is the smallest of the challenges.

As so often in software development, the actual pitfalls that keep you from being successful with your project - here, creating a robust application - are not to be found in the area of creating code. Based on my experiences, the actual pitfalls are to be found in areas that are at best loosely related to resilient software design.

In this talk, I discuss some of those pitfalls that I have experienced more than once along my way. It starts with not understanding the goals of resilient software design, continues from a lack of understanding the characteristics of distributed system, over missing required feedback loops and deficiencies in functional design, to not understanding the trade-offs of applying resilience patterns, and ends with the problem of our continuous collective insight loss.

The main objective of the talk is to sensitize for the pitfalls. Wherever possible I also added some suggestions how to deal with the topics. Unfortunately, some topics neither have an obvious nor a simple solutions - at least none that I would know about ...

As always the voice track is missing and thus a huge part of the content of the talk. Yet, I hope the slides in themselves are of some use for you and offer some helpful ideas and pointers.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

The 7 quests of resilient software design

  1. 1. The 7 quests of resilient software design A guide for the adventurous software engineer Uwe Friedrichsen – codecentric AG – 2012-2017
  2. 2. Uwe Friedrichsen IT traveller. Connecting the dots. Attracted by uncharted territory. CTO at codecentric. https://www.slideshare.net/ufried https://medium.com/@ufried @ufried
  3. 3. You want to do resilient software design ...
  4. 4. ... and you expect everything to be like this
  5. 5. But somehow it feels more like that ...
  6. 6. ... or even that
  7. 7. What the **** went wrong?
  8. 8. The road to resilience is a twisted one
  9. 9. “7 quests you must complete!”
  10. 10. Quest #1
  11. 11. Understand the business case
  12. 12. “How much money will we earn with it?”
  13. 13. “Does it improve our velocity?”
  14. 14. Resilience is not about making money Resilience is about not losing money
  15. 15. Lack of resilient software design Reduced system availability Users cannot do what they intend to do Less transactions per time period Immediate lost revenue Users get annoyed Churn rate increases Delayed lost revenue Due to non-determinism of distributed systems This is at most your resilience budget
  16. 16. Quest #2
  17. 17. Embrace distributed systems
  18. 18. Everything fails, all the time. -- Werner Vogels
  19. 19. If X then Y What we learned in our IT education If X then maybe Y This changes everything! What we need for distributed systems We are good at this (due to how our brains work) Inside process thinking Reasoning about deterministic behavior Designing a complicated system We are poor at that (due to how our brains work) Reasoning about non-deterministic behavior Across process thinking Designing a complex system
  20. 20. Yet, we usually use deterministic thinking to reason about distributed systems
  21. 21. Failures in distributed systems ... •  Crash failure •  Omission failure •  Timing failure •  Response failure •  Byzantine failure
  22. 22. ... turn seemingly simple issues into very hard ones Time & Ordering Leslie Lamport "Time, clocks, and the ordering of events in distributed systems" Consensus Leslie Lamport ”The part-time parliament” (Paxos) CAP Eric A. Brewer "Towards robust distributed systems" Faulty processes Leslie Lamport, Robert Shostak, Marshall Pease "The Byzantine generals problem" Consensus Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson "Impossibility of distributed consensus with one faulty process” (FLP) Impossibility Nancy A. Lynch ”A hundred impossibility proofs for distributed computing"
  23. 23. Embrace distributed systems •  Distributed systems introduce non-determinism regarding •  Execution completeness •  Message ordering •  Communication timing •  You will be affected by this at the application level •  Don’t expect your infrastructure to hide all effects from you •  Better have a plan to detect and recover from inconsistencies
  24. 24. But do I really need to care? (The system, I am working on, is not a distributed system)
  25. 25. (Almost) every system is a distributed system -- Chas Emerick http://www.infoq.com/presentations/problems-distributed-systems
  26. 26. … and it’s getting “worse” •  Cloud-based systems •  Microservices •  Zero Downtime •  Mobile & IoT •  Social Web
  27. 27. Quest #3
  28. 28. Avoid the “100% available” trap
  29. 29. The “100% available” trap, version #1 You: “How should the application respond if a technical failure occurs?” Business owner: “This must not happen! It is your responsibility to make sure that this will not happen.”
  30. 30. The “100% available” trap, version #2 You: “How do you handle the situation if the service you call does not respond (or does not respond timely)?” Developer 1: “We did not implement any extra measures. The other service is so important and thus needs to be so highly available that it is not worth any extra effort.” Developer 2: “Actually, if that service should be down, we would not be able to do anything useful anyway. Thus, it just needs to be up.”
  31. 31. The question is not, if a failure will happen The question is, when a failure will happen
  32. 32. A short note about availability Assume a service availability of 99,5% (incl. planned downtimes) •  10 services involved in a request à 95,1% probability of success •  50 services involved in a request à 77,8% probability of success
  33. 33. Quest #4
  34. 34. Establish the ops-dev feedback loop
  35. 35. The big wall between Dev and Ops
  36. 36. In a distributed environment, you cannot solve availability issues on an infrastructure level only
  37. 37. Dev Ops “I implemented something to improve production availability” “Here are the figures how it worked” Continuous improvement cycle of resilient software design Dev is where you implement your resilience measures Build Measure Learn Ops is where your resilience measures take effect
  38. 38. Dev Ops “I implemented something to improve production availability” “Here are the figures how it worked” Continuous improvement cycle of resilient software design Dev is where you implement your resilience measures Build Measure Learn Ops is where your resilience measures take effect All developer activities towards improving robustness are basically “shooting at the dark” which is neither effective not sustainable Having a wall between Dev and Ops breaks the cycle required to implement effective robustness measures
  39. 39. For effective resilient software design you need a working ops-dev feedback loop
  40. 40. Establishing the feedback loop •  Adopt DevOps •  Adopt Site Reliability Engineering (SRE) •  Or do it your own way if you know a better way ... •  ... but make sure you establish the required feedback loops!
  41. 41. Quest #5
  42. 42. Master functional design
  43. 43. Without proper functional design nothing else matters
  44. 44. Isolation •  System must not fail as a whole •  Split system in parts and isolate parts against each other •  Avoid cascading failures •  Foundation of resilient software design
  45. 45. Bulkhead •  Bulkheads implement the “parts” that need to be isolated •  Core isolation pattern (a.k.a. “failure units” or “units of mitigation”) •  Diverse implementation choices available, e.g., (micro)services, actors, SCS, ... •  Shaping good bulkheads is a pure functional design issue (and extremely hard)
  46. 46. Hmm, sound easy. Why should that be hard?
  47. 47. Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e. the isolation is broken by design How do we avoid this …
  48. 48. Service Request Due to functional design we need to call a lot of services to be able to answer a client request, i.e. availability is broken by design ... and this ... Service Service Service Service Service Service Service Service Service Service Service Service
  49. 49. Mothership Service (a.k.a. Monolith) Request By trying to avoid the aforementioned issues we ended up with cramming all required functionality in one big service i.e. the isolation is broken by design ... without ending up with this?
  50. 50. Let us apply our well-known best practices •  Divide & conquer a.k.a. functional decomposition •  DRY (Don’t Repeat Yourself) •  Design for reusability •  Layered architecture •  …
  51. 51. Unfortunately ...
  52. 52. Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e. the isolation is broken by design ... this usually leads to this …
  53. 53. Service Request Due to functional design we need to call a lot of services to be able to answer a client request, i.e. availability is broken by design ... and this ... Service Service Service Service Service Service Service Service Service Service Service Service
  54. 54. Mothership Service (a.k.a. Monolith) Request By trying to avoid the aforementioned issues we ended up with cramming all required functionality in one big service i.e. the isolation is broken by design ... and in the end also often to this.
  55. 55. Welcome to distributed hell!
  56. 56. Caches to the rescue!
  57. 57. Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e. the isolation is broken by design CacheofB Break tight service coupling by caching data/responses of downstream service
  58. 58. Caches to the rescue?
  59. 59. Do you really think that copying stale data all over your system is a suitable measure to fix an inherently broken design? * * Side note: Caches are a very important and powerful measure in many places. But they are not suitable as a cheap fix for a broken functional design
  60. 60. We have to re-learn design for distributed system
  61. 61. No silver bullet
  62. 62. Yet, a few guiding thoughts ...
  63. 63. Foundations of design •  “High cohesion, low coupling” & “separation of concerns” •  “Crucial across process boundaries •  Still poorly understood issue •  Start with •  Understanding organizational boundaries •  Understanding use cases and flows •  Identifying functional domains (à DDD) •  Finding areas that change independently •  Do not start with a data model!
  64. 64. Short activation paths •  Long activation paths affect availability •  Increase likelihood of failures •  Minimize remote calls per request •  Need to balance opposing forces •  Avoid monolith à clear separation of concerns •  Minimize requests à cluster functionality & data •  Caches can sometimes help, but stale data as trade-off
  65. 65. Be (extremely) wary of reusability •  Reusability increases coupling •  Reusability usually leads to bad service design •  Reusability compromises availability •  Reusability rarely pays •  Do not strive for reusable services •  Strive for replaceable services instead •  Try to tackle reusability issues with libraries
  66. 66. Quest #6
  67. 67. Know your toolbox
  68. 68. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Redundancy Stateless Idempotency Escalation Zero downtime deployment Location transparency Relaxed temporal constraints Fallback Shed load Share load Marked data Queue for resources Bounded queue Finish work in progress Fresh work before stale Deferrable work Communication paradigm Isolation Bulkhead System level Monitor Watchdog Heartbeat Either level Voting Synthetic transaction Leaky bucket Routine checks Health check Fail fast Let sleeping dogs lie Small releases Hot deployments Routine maintenance Backup request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy Backpressure Retry Limit retries Rollback Roll-forward Checkpoint Safe point Failover Read repair Error handler Reset Restart Reconnect Fail silently Default value Node level Timeout Circuit breaker Complete parameter checking Checksum Statically Dynamically Confinement Acknowledgement
  69. 69. Using resilience patterns •  Patterns are options, not obligations •  Don’t pick too many patterns •  Each pattern increases complexity •  Complexity is the enemy of robustness •  Each pattern costs money in dev & ops •  You only have a limited resilience budget •  Look for complementary patterns
  70. 70. How other people did it
  71. 71. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Escalation Communication paradigm Isolation System level Monitor Heartbeat Either level Hot deployments Restart Let it crash! Node level Actor Messaging Erlang (Akka) Core patterns
  72. 72. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Fallback Share load Bounded queue Communication paradigm Isolation System level Monitor Either level Error injection Retry Limit retries Node level Circuit breaker Timeout Zero downtime deployment Canary releases Redundancy Several variants (Micro)service Request/ response Netflix Core patterns
  73. 73. Quest #7
  74. 74. Preserve the collective memory
  75. 75. We face a new generation of developers every 5 years
  76. 76. We loose our collective memory every 5 years * * Mean time until a topic discussion in the community starts over form scratch
  77. 77. Time working in IT Growth of knowledge Depth of insights
  78. 78. What do we do to compensate this effect?
  79. 79. We look for the new & shiny stuff ...
  80. 80. ... as anything not new must be useless crap!
  81. 81. We need to rediscover our insights every 5 years
  82. 82. In IT, we suffer from continuous collective amnesia and we are even proud of it
  83. 83. How can we become better?
  84. 84. Wrap-up
  85. 85. The 7 quests at a glance
  86. 86. Wrap-up •  The road to resilient software design is a twisted one! •  Most challenges are only indirectly related to RSD •  Most challenges are not coding related •  Mastering functional design is extremely hard ... •  ... while learning the patterns is relatively easy •  How do we preserve our collective memory?
  87. 87. Uwe Friedrichsen IT traveller. Connecting the dots. Attracted by uncharted territory. CTO at codecentric. https://www.slideshare.net/ufried https://medium.com/@ufried @ufried

×