SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Bảo mật Dành cho Tên công ty Phiên bản 1.0
Latency Control & Supervision in
Resilience Design Patterns
Tu Pham - CTO @ Eway
Bảo mật Dành cho Tên công ty Phiên bản 1.0
Terminology
Why It So
IMPORTANT
Why It So HARD
Design Patterns
Anti Patterns
Q & A
TOC
Terminology
Distributed Systems
These are networked components which communicate with each other
by passing messages most often to achieve a common goal.
Resiliency
The capacity of any system to recover from difficulties.
Availability
Probability that any system is operating at time `t`.
Reliability
Degree to which a system / component performs specified functions
under specified conditions for a specified period of time
Faults
Fault is an incorrect internal state in your
system. Examples:
1. Slowing down of storage layer
2. Memory leaks in application
3. Blocked threads
4. Dependency failures
5. Bad data propagating in the system (Most
often because there’s not enough validations
on input data)
Terminology
Failure
Failure is an inability of the system to perform
its intended job. Examples:
Failure means loss of Up-Time and availability
on systems. Faults if not contained from
propagating, can lead to failures.
Why It So IMPORTANT
1
Losing customers and partners to
competitors => Financial losses for the
company
2
Affecting livelihood of publishers and
advertisers
3
Affecting salary and bonus of OUR TEAM
:))
4
Affecting services for customers and
colleges
But building resiliency in a complex
micro-services architecture with
multiple distributed systems
communicating with each other is
difficult.
Why It So HARD
Some of the things which make it
hard are:
1. The network is unreliable
2. Dependencies can always fail
3. User behavior is unpredictable
Why It So HARD
Patterns
Latency
Control
● Complements isolation
● Detection and handling of non-timely
responses
● Avoid cascading temporal failures
● Different approaches and patterns available
0
20
40
60
80
Timeout
● Preserve responsiveness
independent of downstream latency
● Measure response time of
downstream calls
● Stop waiting after a pre-determined
timeout
● Take alternate action if timeout was
reached
Fail Fast
● “If you know you’re going to fail, you
better fail fast”
● Avoid foreseeable failures
● Usually implemented by adding
checks in front of costly actions
● Enhances probability of not failing
Circuit Breaker
● Probably most often cited resilience
pattern
● Extension of the timeout pattern
● Takes downstream unit offline if
calls fail multiple times
● Specific variant of the fail fast
pattern
Fan out & quickest
reply
● Send request to multiple workers
● Use quickest reply and discard all
other responses
● Reduces probability of latent
responses
● Tradeoff is WASTE of resources
Bounded Queues
● Limit request queue sizes in front of
highly utilized resources
● Avoids latency due to overloaded
resources
● Introduces pushback on the callers
● Another variant of the fail fast
pattern
Supervision
● Provides failure handling beyond the means of
a single failure unit
● Detect unit failures
● Provide means for error escalation
● Different approaches and patterns available
Shed Load
● Upstream isolation pattern
● Avoid becoming overloaded due to
too many requests
● Install a gatekeeper in front of the
resource
● Shed requests based on resource
load
Monitor
● Observe unit behavior and
interactions from the outside
● Automatically respond to detected
failures
● Part of the system – complex failure
handling strategies possible
● Outside the system – more robust
against system level failures
Error Handler
● Units often don’t have enough time
or information to handle errors
● Separate business logic and error
handling
● Business logic just focuses on
getting the task done (quickly)
● Error handler has sufficient time
and information to handle errors
Escalation
● Units often don’t have enough time
or information to handle errors
● Escalation peer with more time and
information needed
● Often multi-level hierarchies
● Pure design issue
Other
Patterns
Fallback
● Units often don’t have enough time
or information to handle errors
● Instead of aborting the computation
because of a missing response, we
fill in a fallback value.
● Of course, it can be DANGEROUS !!!
Retry
● Units have enough time or
information to handle errors
● Just send the requests again and
again til it reach the BOUNDARY of
policy
Escalation
● Units often don’t have enough time
or information to handle errors
● Escalation peer with more time and
information needed
● Often multi-level hierarchies
● Pure design issue
Just Don’t
● Infinity delay
● One config / policy for all situations
● Fallback logics without confirmation from
business departments / upper managers
● Laggy / buggy monitoring system
References
● https://github.com/Netflix/Hystrix
● https://github.com/alibaba/Sentinel
● https://github.com/resilience4j/resilience4j
● https://github.com/jhalterman/failsafe
“Just Design Our Systems For Failure”
Q&A

Weitere ähnliche Inhalte

Was ist angesagt?

Cloud university intel security
Cloud university intel securityCloud university intel security
Cloud university intel security
Ingram Micro Cloud
 

Was ist angesagt? (20)

Cloud university intel security
Cloud university intel securityCloud university intel security
Cloud university intel security
 
Security As A Service In Cloud(SECaaS)
Security As A Service In Cloud(SECaaS)Security As A Service In Cloud(SECaaS)
Security As A Service In Cloud(SECaaS)
 
RightScale Webinar: Security Monitoring in the Cloud: How RightScale Does It
RightScale Webinar: Security Monitoring in the Cloud: How RightScale Does ItRightScale Webinar: Security Monitoring in the Cloud: How RightScale Does It
RightScale Webinar: Security Monitoring in the Cloud: How RightScale Does It
 
Securing Applications in the Cloud
Securing Applications in the CloudSecuring Applications in the Cloud
Securing Applications in the Cloud
 
The Top Cloud Security Issues
The Top Cloud Security IssuesThe Top Cloud Security Issues
The Top Cloud Security Issues
 
Rethinking Security: The Cloud Infrastructure Effect
Rethinking Security: The Cloud Infrastructure EffectRethinking Security: The Cloud Infrastructure Effect
Rethinking Security: The Cloud Infrastructure Effect
 
cloud security ppt
cloud security ppt cloud security ppt
cloud security ppt
 
Security for cloud native workloads
Security for cloud native workloadsSecurity for cloud native workloads
Security for cloud native workloads
 
Assessing System Risk the Smart Way
Assessing System Risk the Smart WayAssessing System Risk the Smart Way
Assessing System Risk the Smart Way
 
Cloud security
Cloud securityCloud security
Cloud security
 
Cloud Security Demystified
Cloud Security DemystifiedCloud Security Demystified
Cloud Security Demystified
 
Cloud security privacy- org
Cloud security  privacy- orgCloud security  privacy- org
Cloud security privacy- org
 
Managed Threat Detection & Response for AWS Applications
Managed Threat Detection & Response for AWS ApplicationsManaged Threat Detection & Response for AWS Applications
Managed Threat Detection & Response for AWS Applications
 
Cloud Security Engineering - Tools and Techniques
Cloud Security Engineering - Tools and TechniquesCloud Security Engineering - Tools and Techniques
Cloud Security Engineering - Tools and Techniques
 
Css sf azure_8-9-17 - 5_ways to_optimize_your_azure_infrastructure_thayer gla...
Css sf azure_8-9-17 - 5_ways to_optimize_your_azure_infrastructure_thayer gla...Css sf azure_8-9-17 - 5_ways to_optimize_your_azure_infrastructure_thayer gla...
Css sf azure_8-9-17 - 5_ways to_optimize_your_azure_infrastructure_thayer gla...
 
Managing Cloud Security Risks in Your Organization
Managing Cloud Security Risks in Your OrganizationManaging Cloud Security Risks in Your Organization
Managing Cloud Security Risks in Your Organization
 
#ALSummit: Realities of Security in the Cloud
#ALSummit: Realities of Security in the Cloud#ALSummit: Realities of Security in the Cloud
#ALSummit: Realities of Security in the Cloud
 
CSS17: Houston - Azure Shared Security Model Overview
CSS17: Houston - Azure Shared Security Model OverviewCSS17: Houston - Azure Shared Security Model Overview
CSS17: Houston - Azure Shared Security Model Overview
 
Venom vulnerability Overview and a basic demo
Venom vulnerability Overview and a basic demoVenom vulnerability Overview and a basic demo
Venom vulnerability Overview and a basic demo
 
Cloud Security - Kloudlearn
Cloud Security - KloudlearnCloud Security - Kloudlearn
Cloud Security - Kloudlearn
 

Ähnlich wie Latency Control And Supervision In Resilience Design Patterns

Fault tolerance in distributed systems
Fault tolerance in distributed systemsFault tolerance in distributed systems
Fault tolerance in distributed systems
sumitjain2013
 
Goal Driven Performance Optimization, Peter Zaitsev
Goal Driven Performance Optimization, Peter ZaitsevGoal Driven Performance Optimization, Peter Zaitsev
Goal Driven Performance Optimization, Peter Zaitsev
Fuenteovejuna
 
Fault Tolerance System
Fault Tolerance SystemFault Tolerance System
Fault Tolerance System
Ehsan Ilahi
 
Goal driven performance optimization (Пётр Зайцев)
Goal driven performance optimization (Пётр Зайцев)Goal driven performance optimization (Пётр Зайцев)
Goal driven performance optimization (Пётр Зайцев)
Ontico
 

Ähnlich wie Latency Control And Supervision In Resilience Design Patterns (20)

Fault tolerance in distributed systems
Fault tolerance in distributed systemsFault tolerance in distributed systems
Fault tolerance in distributed systems
 
Distributed DBMS - Unit 9 - Distributed Deadlock & Recovery
Distributed DBMS - Unit 9 - Distributed Deadlock & RecoveryDistributed DBMS - Unit 9 - Distributed Deadlock & Recovery
Distributed DBMS - Unit 9 - Distributed Deadlock & Recovery
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
 
Application and Website Security -- Designer Edition: Using Formal Specificat...
Application and Website Security -- Designer Edition:Using Formal Specificat...Application and Website Security -- Designer Edition:Using Formal Specificat...
Application and Website Security -- Designer Edition: Using Formal Specificat...
 
Concurrency
ConcurrencyConcurrency
Concurrency
 
Fault tolerance techniques
Fault tolerance techniquesFault tolerance techniques
Fault tolerance techniques
 
Goal Driven Performance Optimization, Peter Zaitsev
Goal Driven Performance Optimization, Peter ZaitsevGoal Driven Performance Optimization, Peter Zaitsev
Goal Driven Performance Optimization, Peter Zaitsev
 
Survey Presentation About Application Security
Survey Presentation About Application SecuritySurvey Presentation About Application Security
Survey Presentation About Application Security
 
Why Software Test Performance Matters
Why Software Test Performance MattersWhy Software Test Performance Matters
Why Software Test Performance Matters
 
Fault Tolerance System
Fault Tolerance SystemFault Tolerance System
Fault Tolerance System
 
Performance engineering methodologies
Performance engineering  methodologiesPerformance engineering  methodologies
Performance engineering methodologies
 
Fault tolerance review by tsegabrehan zerihun
Fault tolerance review by tsegabrehan zerihunFault tolerance review by tsegabrehan zerihun
Fault tolerance review by tsegabrehan zerihun
 
Dependable Systems -Fault Tolerance Patterns (4/16)
Dependable Systems -Fault Tolerance Patterns (4/16)Dependable Systems -Fault Tolerance Patterns (4/16)
Dependable Systems -Fault Tolerance Patterns (4/16)
 
High Reliabilty Systems
High Reliabilty SystemsHigh Reliabilty Systems
High Reliabilty Systems
 
Software Performance
Software Performance Software Performance
Software Performance
 
Door to perfomance testing
Door to perfomance testingDoor to perfomance testing
Door to perfomance testing
 
Ch20
Ch20Ch20
Ch20
 
Working Effectively with PeopleSoft Support
Working Effectively with PeopleSoft SupportWorking Effectively with PeopleSoft Support
Working Effectively with PeopleSoft Support
 
Goal driven performance optimization (Пётр Зайцев)
Goal driven performance optimization (Пётр Зайцев)Goal driven performance optimization (Пётр Зайцев)
Goal driven performance optimization (Пётр Зайцев)
 
Parallel and Distributed Computing Chapter 12
Parallel and Distributed Computing Chapter 12Parallel and Distributed Computing Chapter 12
Parallel and Distributed Computing Chapter 12
 

Mehr von Tu Pham

Secure your app against DDOS, API Abuse, Hijacking, and Fraud
 Secure your app against DDOS, API Abuse, Hijacking, and Fraud Secure your app against DDOS, API Abuse, Hijacking, and Fraud
Secure your app against DDOS, API Abuse, Hijacking, and Fraud
Tu Pham
 

Mehr von Tu Pham (20)

Go from idea to app with no coding using AppSheet.pptx
Go from idea to app with no coding using AppSheet.pptxGo from idea to app with no coding using AppSheet.pptx
Go from idea to app with no coding using AppSheet.pptx
 
Secure your app against DDOS, API Abuse, Hijacking, and Fraud
 Secure your app against DDOS, API Abuse, Hijacking, and Fraud Secure your app against DDOS, API Abuse, Hijacking, and Fraud
Secure your app against DDOS, API Abuse, Hijacking, and Fraud
 
Challenges In Implementing SRE
Challenges In Implementing SREChallenges In Implementing SRE
Challenges In Implementing SRE
 
IT Strategy
IT Strategy IT Strategy
IT Strategy
 
Set up Learn and Development program
Set up Learn and Development programSet up Learn and Development program
Set up Learn and Development program
 
Cost Management For IT Project / Product
Cost Management For IT Project / ProductCost Management For IT Project / Product
Cost Management For IT Project / Product
 
Minimum Viable Product 101
Minimum Viable Product 101Minimum Viable Product 101
Minimum Viable Product 101
 
Understand your customers
Understand your customersUnderstand your customers
Understand your customers
 
Let's build great products for mid-size companies
Let's build great products for mid-size companiesLet's build great products for mid-size companies
Let's build great products for mid-size companies
 
End To End Business Intelligence On Google Cloud
End To End Business Intelligence On Google CloudEnd To End Business Intelligence On Google Cloud
End To End Business Intelligence On Google Cloud
 
High Output Tech Management
High Output Tech Management High Output Tech Management
High Output Tech Management
 
Big Data Driven At Eway
Big Data Driven At Eway Big Data Driven At Eway
Big Data Driven At Eway
 
Security On The Cloud
Security On The CloudSecurity On The Cloud
Security On The Cloud
 
Eway Tech Talk #2 Coding Guidelines
Eway Tech Talk #2 Coding GuidelinesEway Tech Talk #2 Coding Guidelines
Eway Tech Talk #2 Coding Guidelines
 
End To End Machine Learning With Google Cloud
End To End Machine Learning With Google Cloud End To End Machine Learning With Google Cloud
End To End Machine Learning With Google Cloud
 
Eway Tech Talk #0 Knowledge Sharing
Eway Tech Talk #0 Knowledge SharingEway Tech Talk #0 Knowledge Sharing
Eway Tech Talk #0 Knowledge Sharing
 
Php 5.6 vs Php 7 performance comparison
Php 5.6 vs Php 7 performance comparisonPhp 5.6 vs Php 7 performance comparison
Php 5.6 vs Php 7 performance comparison
 
System Security on Cloud
System Security on CloudSystem Security on Cloud
System Security on Cloud
 
Big Data at DYNO
Big Data at DYNOBig Data at DYNO
Big Data at DYNO
 
Big data on google cloud
Big data on google cloudBig data on google cloud
Big data on google cloud
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Latency Control And Supervision In Resilience Design Patterns

  • 1. Bảo mật Dành cho Tên công ty Phiên bản 1.0 Latency Control & Supervision in Resilience Design Patterns Tu Pham - CTO @ Eway
  • 2. Bảo mật Dành cho Tên công ty Phiên bản 1.0 Terminology Why It So IMPORTANT Why It So HARD Design Patterns Anti Patterns Q & A TOC
  • 3. Terminology Distributed Systems These are networked components which communicate with each other by passing messages most often to achieve a common goal. Resiliency The capacity of any system to recover from difficulties. Availability Probability that any system is operating at time `t`. Reliability Degree to which a system / component performs specified functions under specified conditions for a specified period of time
  • 4. Faults Fault is an incorrect internal state in your system. Examples: 1. Slowing down of storage layer 2. Memory leaks in application 3. Blocked threads 4. Dependency failures 5. Bad data propagating in the system (Most often because there’s not enough validations on input data) Terminology Failure Failure is an inability of the system to perform its intended job. Examples: Failure means loss of Up-Time and availability on systems. Faults if not contained from propagating, can lead to failures.
  • 5.
  • 6. Why It So IMPORTANT 1 Losing customers and partners to competitors => Financial losses for the company 2 Affecting livelihood of publishers and advertisers 3 Affecting salary and bonus of OUR TEAM :)) 4 Affecting services for customers and colleges
  • 7. But building resiliency in a complex micro-services architecture with multiple distributed systems communicating with each other is difficult. Why It So HARD
  • 8. Some of the things which make it hard are: 1. The network is unreliable 2. Dependencies can always fail 3. User behavior is unpredictable Why It So HARD
  • 10.
  • 11. Latency Control ● Complements isolation ● Detection and handling of non-timely responses ● Avoid cascading temporal failures ● Different approaches and patterns available 0 20 40 60 80
  • 12. Timeout ● Preserve responsiveness independent of downstream latency ● Measure response time of downstream calls ● Stop waiting after a pre-determined timeout ● Take alternate action if timeout was reached
  • 13.
  • 14. Fail Fast ● “If you know you’re going to fail, you better fail fast” ● Avoid foreseeable failures ● Usually implemented by adding checks in front of costly actions ● Enhances probability of not failing
  • 15. Circuit Breaker ● Probably most often cited resilience pattern ● Extension of the timeout pattern ● Takes downstream unit offline if calls fail multiple times ● Specific variant of the fail fast pattern
  • 16.
  • 17.
  • 18.
  • 19. Fan out & quickest reply ● Send request to multiple workers ● Use quickest reply and discard all other responses ● Reduces probability of latent responses ● Tradeoff is WASTE of resources
  • 20. Bounded Queues ● Limit request queue sizes in front of highly utilized resources ● Avoids latency due to overloaded resources ● Introduces pushback on the callers ● Another variant of the fail fast pattern
  • 21.
  • 22. Supervision ● Provides failure handling beyond the means of a single failure unit ● Detect unit failures ● Provide means for error escalation ● Different approaches and patterns available
  • 23. Shed Load ● Upstream isolation pattern ● Avoid becoming overloaded due to too many requests ● Install a gatekeeper in front of the resource ● Shed requests based on resource load
  • 24. Monitor ● Observe unit behavior and interactions from the outside ● Automatically respond to detected failures ● Part of the system – complex failure handling strategies possible ● Outside the system – more robust against system level failures
  • 25. Error Handler ● Units often don’t have enough time or information to handle errors ● Separate business logic and error handling ● Business logic just focuses on getting the task done (quickly) ● Error handler has sufficient time and information to handle errors
  • 26. Escalation ● Units often don’t have enough time or information to handle errors ● Escalation peer with more time and information needed ● Often multi-level hierarchies ● Pure design issue
  • 27.
  • 29. Fallback ● Units often don’t have enough time or information to handle errors ● Instead of aborting the computation because of a missing response, we fill in a fallback value. ● Of course, it can be DANGEROUS !!!
  • 30. Retry ● Units have enough time or information to handle errors ● Just send the requests again and again til it reach the BOUNDARY of policy
  • 31. Escalation ● Units often don’t have enough time or information to handle errors ● Escalation peer with more time and information needed ● Often multi-level hierarchies ● Pure design issue
  • 32. Just Don’t ● Infinity delay ● One config / policy for all situations ● Fallback logics without confirmation from business departments / upper managers ● Laggy / buggy monitoring system
  • 33.
  • 34.
  • 35. References ● https://github.com/Netflix/Hystrix ● https://github.com/alibaba/Sentinel ● https://github.com/resilience4j/resilience4j ● https://github.com/jhalterman/failsafe
  • 36. “Just Design Our Systems For Failure” Q&A