Monitoring Graceful Failure

•

0 gefällt mir•241 views

How can you be sure that your team is alerted of a failure before it causes an outage for your users? The move from monolith to microservice has allowed pieces of functionality to be deployed individually and on demand. Having functionality isolated allows the opportunity for one microservice to fail without bringing down the whole system. However, the complexity of releasing and monitoring API calls being made across services has increased. Whether you’re launching a new product or iterating on a feature, delivering a delightful experience is crucial to your success. If something is to fail, you’d prefer your users didn’t know. Be thoughtful about how your system will degrade, how to inject failure to verify your design, and how this is monitored. In this Sensu Summit 2019 talk, Lorne Kligerman, Director of Product at Gremlin, will cover failing gracefully as an engineering goal which can be confidently tested and monitored with Chaos Engineering. By purposely causing failure of one service at a time in a controlled environment, you can safely observe and react in a timely manner to limit the effect on the end user.

Technologie

Monitoring
Graceful Failure
Lorne Kligerman
Director of Product, Gremlin
@lklig
Aaron Sachs
Customer Reliability Engineer, Sensu
@asachs01

4
Be down in 10!
T-Ho 2017
Hey team… bit of a spill but
I’m fine.

Technical Issues Likely Cost Retailers
Billions
Macy’s, Lowe’s hit by Black Friday
technical glitches
Retail outages online leave shoppers
frustrated on Black Friday
People.com
Black Friday Failures
@lklig

Computer Problems Blamed For
Flight Delays
4.1.19
Major US Airlines hit by delays after glitch at
vendor
4.1.19
Pilots of doomed Boeing 737 MAX fought
the plane’s software and lost
4.4.19
Airline Incidents
@lklig

8
Technology is fragile.
Plan ahead to
keep your
users happy
FAILURE
GRACEFUL
DEGRADATION
@lklig

11
Lack of Testing
Failure
UI
End to end
Integration
Unit
@lklig

15@lklig
Loading Screens
Are Not Graceful

16
Fail on Your Own Terms
Key User Stories
& Features
Edge Cases From
Unexpected User
Behaviour
Dependency Failures
@lklig

17
Inject Failure
By Breaking Things
On Purpose
@lklig

Inject failure one
service at a time.
Maintain critical
functionality.
18@lklig

20@lklig
When one
dependency
fails, users are
often affected
Storage
Auth
User Data
Content
Cache
Feature 1
Feature 2

32
RELIABILITY THROUGH CHAOS ENGINEERING
Design for Failure
Identify the most
critical end user
functionality.
Inject Failure
Impact your system to be
sure your user experience
isn’t impacted.
Degrade Gracefully
Plan for non critical
functionality not to
get in the way.
Delight Your Users
Your product metrics will
show behaviour, no
matter the condition.
Graceful Failure
@lklig

Q&A
Lorne Kligerman
Director of Product, Gremlin
@lklig
Aaron Sachs
Customer Reliability Engineer, Sensu
@asachs01

Empfohlen

7 Keys for Unattended Test AUtomation webinar deckPerfecto Mobile

Google's Exact Match Domain Algorithmamy brenan

How to Detect & Resolve Five Common Citrix XenApp & XenDesktop Performance Ch...eG Innovations

Avoid the IT War Room: Integrate Mainframe and IBM i into ServiceNowPrecisely

Secure VoIP - DroidCon 2015Marco Pozzato

ERP Security. Myths, Problems, SolutionsERPScan

We4IT lcty 2013 - captain mobility - mobile domino applications offline capab...We4IT Group

Why Things Go Off the Rails and How to Prevent Product-Engineering AngstOptimizely

Empfohlen

7 Keys for Unattended Test AUtomation webinar deckPerfecto Mobile

Google's Exact Match Domain Algorithmamy brenan

How to Detect & Resolve Five Common Citrix XenApp & XenDesktop Performance Ch...eG Innovations

Avoid the IT War Room: Integrate Mainframe and IBM i into ServiceNowPrecisely

Secure VoIP - DroidCon 2015Marco Pozzato

ERP Security. Myths, Problems, SolutionsERPScan

We4IT lcty 2013 - captain mobility - mobile domino applications offline capab...We4IT Group

Why Things Go Off the Rails and How to Prevent Product-Engineering AngstOptimizely

Phil Koopman's ISSRE 2016 Keynoteedgecaseresearch

What You Need to Know About SaaS Application Data ProtectionSpanning Cloud Apps

Visual Detection Technology in Siemens Gamesa (by Allan Moeller Larsen)TUS Expo

Another Update of Tablet Strategy BootcampPaul Saunders

Part1: Introduction to Project ManagementArry Arman

DevOps goes Mobile (daho.am)Wooga

Welcome to the it lab.pptxAnees120773

Mobile Testing Success: Real World Strategies and TechniquesTechWell

Of innovation and impatience - Future Decoded 2015Christian Heilmann

NDC London 2014: Erlang Patterns Matching Business NeedsTorben Hoffmann

LogLogic SQL Server Hacking DBs April09Mark Ginnebaugh

Tablet Market: Investment AnalysisAjay Singh

Software engineering unit 1Sumit Paul

Riding The N Train: How we dismantled Groupon's Ruby on Rails MonolithSean McCullough

UPDATED: Tablet Strategy BootcampPaul Saunders

A Data Integration Case Study - Avoid Creating a “Franken-Beast”DATAVERSITY

Semicon west monetizing the internet of thingsPaul Brody

Alpha Anywhere presentation at the the Always on Summit -- Building Offline M...Richard Rabins

Agile Australia 2017 Hypothesis-Driven COTS Software Selection Tiago GriffoTiago Griffo

"Can We Have Both Safety and Performance in AI for Autonomous Vehicles?," a P...Edge AI and Vision Alliance

Introducing GoAlert: a brand-new on-call scheduling and notification open sou...Sensu Inc.

The Bonsai Asset Index : A new way for the community to share resourcesSensu Inc.

Weitere ähnliche Inhalte

Ähnlich wie Monitoring Graceful Failure

Phil Koopman's ISSRE 2016 Keynoteedgecaseresearch

What You Need to Know About SaaS Application Data ProtectionSpanning Cloud Apps

Visual Detection Technology in Siemens Gamesa (by Allan Moeller Larsen)TUS Expo

Another Update of Tablet Strategy BootcampPaul Saunders

Part1: Introduction to Project ManagementArry Arman

DevOps goes Mobile (daho.am)Wooga

Welcome to the it lab.pptxAnees120773

Mobile Testing Success: Real World Strategies and TechniquesTechWell

Of innovation and impatience - Future Decoded 2015Christian Heilmann

NDC London 2014: Erlang Patterns Matching Business NeedsTorben Hoffmann

LogLogic SQL Server Hacking DBs April09Mark Ginnebaugh

Tablet Market: Investment AnalysisAjay Singh

Software engineering unit 1Sumit Paul

Riding The N Train: How we dismantled Groupon's Ruby on Rails MonolithSean McCullough

UPDATED: Tablet Strategy BootcampPaul Saunders

A Data Integration Case Study - Avoid Creating a “Franken-Beast”DATAVERSITY

Semicon west monetizing the internet of thingsPaul Brody

Alpha Anywhere presentation at the the Always on Summit -- Building Offline M...Richard Rabins

Agile Australia 2017 Hypothesis-Driven COTS Software Selection Tiago GriffoTiago Griffo

"Can We Have Both Safety and Performance in AI for Autonomous Vehicles?," a P...Edge AI and Vision Alliance

Ähnlich wie Monitoring Graceful Failure (20)

Phil Koopman's ISSRE 2016 Keynote

What You Need to Know About SaaS Application Data Protection

Visual Detection Technology in Siemens Gamesa (by Allan Moeller Larsen)

Another Update of Tablet Strategy Bootcamp

Part1: Introduction to Project Management

DevOps goes Mobile (daho.am)

Welcome to the it lab.pptx

Mobile Testing Success: Real World Strategies and Techniques

Of innovation and impatience - Future Decoded 2015

NDC London 2014: Erlang Patterns Matching Business Needs

LogLogic SQL Server Hacking DBs April09

Tablet Market: Investment Analysis

Software engineering unit 1

Riding The N Train: How we dismantled Groupon's Ruby on Rails Monolith

UPDATED: Tablet Strategy Bootcamp

A Data Integration Case Study - Avoid Creating a “Franken-Beast”

Semicon west monetizing the internet of things

Alpha Anywhere presentation at the the Always on Summit -- Building Offline M...

Agile Australia 2017 Hypothesis-Driven COTS Software Selection Tiago Griffo

"Can We Have Both Safety and Performance in AI for Autonomous Vehicles?," a P...

Mehr von Sensu Inc.

Introducing GoAlert: a brand-new on-call scheduling and notification open sou...Sensu Inc.

The Bonsai Asset Index : A new way for the community to share resourcesSensu Inc.

PPB's Sensu JourneySensu Inc.

Testing and monitoring and broken thingsSensu Inc.

Order from chaos: automating monitoring configurationSensu Inc.

Keynote: Measuring the right thingsSensu Inc.

Keynote: Scaling Sensu GoSensu Inc.

Keynote: Sensu as a multi-cloud monitoring control planeSensu Inc.

AIOps & Observability to Lead Your Digital TransformationSensu Inc.

Ecosystem session: Sensu + PuppetSensu Inc.

Herding cats & catching fire: Workday's telemetry & middlewareSensu Inc.

7 Years of Sensu: Then, Now, and SoonSensu Inc.

Pull, don’t push: Architectures for monitoring and configuration in a microse...Sensu Inc.

Assets in Sensu 2.0Sensu Inc.

The Box.com success story: migrating 350K Nagios objects to SensuSensu Inc.

Project 3M: Meaningful Monitoring and MessagingSensu Inc.

Sharing Sensu with Multiple Teams using AnsibleSensu Inc.

Where's My Beer: Building a Better Kegerator with a Raspberry Pi & SensuSensu Inc.

Reimagining SensuSensu Inc.

Alert Fatigue: Avoidance and Course CorrectionSensu Inc.

Mehr von Sensu Inc. (20)

Introducing GoAlert: a brand-new on-call scheduling and notification open sou...

The Bonsai Asset Index : A new way for the community to share resources

PPB's Sensu Journey

Testing and monitoring and broken things

Order from chaos: automating monitoring configuration

Keynote: Measuring the right things

Keynote: Scaling Sensu Go

Keynote: Sensu as a multi-cloud monitoring control plane

AIOps & Observability to Lead Your Digital Transformation

Ecosystem session: Sensu + Puppet

Herding cats & catching fire: Workday's telemetry & middleware

7 Years of Sensu: Then, Now, and Soon

Pull, don’t push: Architectures for monitoring and configuration in a microse...

Assets in Sensu 2.0

The Box.com success story: migrating 350K Nagios objects to Sensu

Project 3M: Meaningful Monitoring and Messaging

Sharing Sensu with Multiple Teams using Ansible

Where's My Beer: Building a Better Kegerator with a Raspberry Pi & Sensu

Reimagining Sensu

Alert Fatigue: Avoidance and Course Correction

Kürzlich hochgeladen

A Call to Action for Generative AI in 2024Results

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Developing An App To Navigate The Roads of BrazilV3cube

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Slack Application Development 101 Slidespraypatel2

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

A Domino Admins Adventures (Engage 2024)Gabriella Davis

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Kürzlich hochgeladen (20)

A Call to Action for Generative AI in 2024

Automating Google Workspace (GWS) & more with Apps Script

Developing An App To Navigate The Roads of Brazil

How to Troubleshoot Apps for the Modern Connected Worker

Boost PC performance: How more available memory can improve productivity

Driving Behavioral Change for Information Management through Data-Driven Gree...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Slack Application Development 101 Slides

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Breaking the Kubernetes Kill Chain: Host Path Mount

08448380779 Call Girls In Friends Colony Women Seeking Men

A Domino Admins Adventures (Engage 2024)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

[2024]Digital Global Overview Report 2024 Meltwater.pdf

IAC 2024 - IA Fast Track to Search Focused AI Solutions

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Unblocking The Main Thread Solving ANRs and Frozen Frames

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Monitoring Graceful Failure

1. Monitoring Graceful Failure Lorne Kligerman Director of Product, Gremlin @lklig Aaron Sachs Customer Reliability Engineer, Sensu @asachs01

2. 2

3. 3

4. 4 Be down in 10! T-Ho 2017 Hey team… bit of a spill but I’m fine.

5. 5 We Expect Technology To Just Work™

6. Technical Issues Likely Cost Retailers Billions Macy’s, Lowe’s hit by Black Friday technical glitches Retail outages online leave shoppers frustrated on Black Friday People.com Black Friday Failures @lklig

7. Computer Problems Blamed For Flight Delays 4.1.19 Major US Airlines hit by delays after glitch at vendor 4.1.19 Pilots of doomed Boeing 737 MAX fought the plane’s software and lost 4.4.19 Airline Incidents @lklig

8. 8 Technology is fragile. Plan ahead to keep your users happy FAILURE GRACEFUL DEGRADATION @lklig

9. 9 Why Are Failures So Common?

10. 10 Legacy Systems @lklig

11. 11 Lack of Testing Failure UI End to end Integration Unit @lklig

12. @lklig

13. 13 What Can We Do About It?

14. 14 Design For Failure

15. 15@lklig Loading Screens Are Not Graceful

16. 16 Fail on Your Own Terms Key User Stories & Features Edge Cases From Unexpected User Behaviour Dependency Failures @lklig

17. 17 Inject Failure By Breaking Things On Purpose @lklig

18. Inject failure one service at a time. Maintain critical functionality. 18@lklig

19. 19 Degrade Gracefully

20. 20@lklig When one dependency fails, users are often affected Storage Auth User Data Content Cache Feature 1 Feature 2

21. 21@lklig

22. 22 Monitoring + Chaos Engineering

23. 23 Let Monitoring Know

24. 24

25. 25

26. 26 Let The Right People Know

27. 27

28. 28

29. 29 Closing the Loop

30. 30

31. 31

32. 32 RELIABILITY THROUGH CHAOS ENGINEERING Design for Failure Identify the most critical end user functionality. Inject Failure Impact your system to be sure your user experience isn’t impacted. Degrade Gracefully Plan for non critical functionality not to get in the way. Delight Your Users Your product metrics will show behaviour, no matter the condition. Graceful Failure @lklig

33. USE inthefamily FOR $50 OFF

34. 34 gremlin.com/lorne

35. Q&A Lorne Kligerman Director of Product, Gremlin @lklig Aaron Sachs Customer Reliability Engineer, Sensu @asachs01