Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

OSMC 2022 | How we improved our monitoring so that everyone likes to be on-call by Daniel Uhlmann

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 31 Anzeige

OSMC 2022 | How we improved our monitoring so that everyone likes to be on-call by Daniel Uhlmann

Herunterladen, um offline zu lesen

Ever wonder why your Engineers don’t necessarily like being on call? There can be many different reasons for this, and one cause could be a poorly configured monitoring system. In this talk I would like to share with you the different stages we went through as a team to get from an inadequate monitoring to a solution that provides real value not only for the customer but also for us as a team.

Ever wonder why your Engineers don’t necessarily like being on call? There can be many different reasons for this, and one cause could be a poorly configured monitoring system. In this talk I would like to share with you the different stages we went through as a team to get from an inadequate monitoring to a solution that provides real value not only for the customer but also for us as a team.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Aktuellste (20)

Anzeige

OSMC 2022 | How we improved our monitoring so that everyone likes to be on-call by Daniel Uhlmann

  1. 1. How we improved our monitoring so that everyone likes to be on-call Page 1 / 31
  2. 2. What you can expect Why on-call can be disrespecting A real world on-call transformation example That on-call can mean alot of different things Observability Engineers who like to be on-call Page 2 / 31
  3. 3. About me Daniel Uhlmann T-Systems Multimedia Solutions GmbH passion for Linux and OpenSource Twitter: @xfuturecs Blog: xfuture-blog.com Page 3 / 31
  4. 4. What we are working on we maintain several customer services and applications our monitoring is very distributed with various services and environments meaning that we need to context switch and to adapt fast a lot Page 4 / 31
  5. 5. "Why should I take the on-call duty. I thought someone else will do this for us." "If you haven't debugged the live database system at 3:00 in the morning, you're not a real developer." "I didn't sign up for this." "I sacrificed so much sleep and lost my mental health being on-call. But this is okay because it is for my/our product" Page 5 / 31
  6. 6. This is not acceptable - so what can we learn from this? there are a lot of toxic patterns about being on-call being on-call can be disrespecting no sleep impacting personal lives flappy alerts will drive you crazy maybe no training if you don't take care every check will alert you Page 6 / 31
  7. 7. Where we came from ...well we had nearly the same problems: a lot of false positives checks lack of detailed monitoring wakeful nights and scared junior engineers with a resting pulse rate of 180 beats per minute been there, done that Page 7 / 31
  8. 8. But we managed to change it Page 8 / 31
  9. 9. Page 9 / 31
  10. 10. Keep in mind that The ultimate goal is not to never get notified again! Page 10 / 31
  11. 11. Every check alarmed us we've set a appointment as a team to figure out which checks are truly business critical implemented 2 "hotlines" to separate 24/7 and business hour calls resulted in lesser calls during night time Page 11 / 31
  12. 12. Our learnings delete every check without any meaningful information for you not all checks are really business critical set the bar high for waking people up at 2 AM Page 12 / 31
  13. 13. Lack of detailed monitoring check more than just the end to end connection of your application figure out the business critical components for your customers is a good first step Page 13 / 31
  14. 14. Our Learnings think from a customers perspective first even better: talk with your customers what is crucial for their business Page 14 / 31
  15. 15. Missing experience on a real outage most uncertainties arise from a lack of preparation utilize the expertise of already experienced colleagues new colleagues get a backup colleague with experience for the first on-call duties simulate a real outage a la chaos engineering Page 15 / 31
  16. 16. Our Learnings remember to breath check if the alert have some linked documentation the biggest obstacle is fear Page 16 / 31
  17. 17. Chaos Engineering experiment on a distributed system to build confidence discover new issues that could impact your services by injecting failures and errors Page 17 / 31
  18. 18. What is the difference between chaos engineering and failure testing? Page 18 / 31
  19. 19. Test in production don't over invest in staging systems and under invest in your production system most bugs will only ever be found with enough user interactions Page 19 / 31
  20. 20. Fix bugs at 2pm and not 2am! failure testing and chaos engineering can help you to fix some of them if you can't track down what's happening in a few minutes you need better observability Page 20 / 31
  21. 21. Measure your paging alerts collect statistics for incoming calls especially out-of-hours track, graph and talk about your paging alerts Page 21 / 31
  22. 22. Qualitative Tracking success is not about "not having incidents" it's about how confident people feel while being on-call Page 22 / 31
  23. 23. Ask your engineers qualitative feedback plays an important role for success ensures that you are on the right track Page 23 / 31
  24. 24. Page 24 / 31
  25. 25. Page 25 / 31
  26. 26. Predictive alarming for example: checks that alarm you if the disk slowly becomes too full only alert if users have real pain reduced our alert frequency even more Page 26 / 31
  27. 27. Assign a role to your monitoring... to keep your monitoring clean to create tickets for occuring events to fix quickwins to update your colleagues about the current state Page 27 / 31
  28. 28. What happens on on-call rotation define a process for the transfer clean up your monitoring Page 28 / 31
  29. 29. Align engineering pain with user pain migrate to SLO based monitoring adopt alerting best practices gain profit through tracking down your pain and pay it down Page 29 / 31
  30. 30. Remember our initial situation? Page 30 / 31
  31. 31. Thank you for listening! Page 31 / 31

×