Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Practical service level objectives with error budgeting

255 Aufrufe

Veröffentlicht am

Talk given at BayLISA May 2019 on SLOs and Error Budgets

Veröffentlicht in: Software
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Practical service level objectives with error budgeting

  1. 1. Practical Service Level Objectives With Error Budgeting Fred Moyer @phredmoyer BayLISA May 16, 2019
  2. 2. Are Errors important? @phredmoyer
  3. 3. Is Latency Important? @phredmoyer
  4. 4. How many errors in your app last week? @phredmoyer
  5. 5. How many requests over 500ms last week? @phredmoyer
  6. 6. Your error/request ratio last week? @phredmoyer
  7. 7. Are slow requests errors? @phredmoyer
  8. 8. Hi I’m Fred ● @phredmoyer ● Monitoring Nerd ● Writing code 20 years ● And breaking prod ● Likes Go, Perl, C, Pg ● Likes SLOs ● Doesn’t like errors @phredmoyer
  9. 9. Talk Agenda ● SLOs and Error Budgets ● Calculating Error Budgets with Logs ● Calculating Error Budgets with Metrics @phredmoyer
  10. 10. What is an Error Budget? @phredmoyer Zero Errors! Happy Users!
  11. 11. What is an Error Budget? @phredmoyer Too much risk = Too many errors Too many errors = Unhappy users Too little risk = No code shipped No code shipped = Unhappy users
  12. 12. What is an Error Budget? @phredmoyer Too much risk = Too many errors Too many errors = Unhappy users Too little risk = No code shipped No code shipped = Unhappy users
  13. 13. What is an Error Budget? @phredmoyer Too much risk = Too many errors Too many errors = Unhappy users Too little risk = No code shipped No code shipped = Unhappy users
  14. 14. What is an Error Budget? @phredmoyer Too much risk = Unhappy users Just enough risk = Happy users Too little risk = Unhappy users
  15. 15. What is an Error Budget? @phredmoyer Error budget = Acceptable risk Acceptable risk = 100%-SLO Error budget = 100%-SLO
  16. 16. @phredmoyer SLOs, How Do They Work?
  17. 17. SLOs, How Do They Work? @phredmoyer SLIs, SLOs, SLAs, oh my! https://www.youtube.com/watch?v=tEylFyxbDLE @lizthegrey ⇔ @sethvargo SLI: 95th %ile requests over 5 min < 300ms SLO: 95th %ile SLI for 1 month succeeds 99.9% SLA: 95th %ile SLI for 1 month succeeds 99.5% or you have to refund money
  18. 18. What is an Error Budget? @phredmoyer SLI: 95th %ile req over 5 min < 300ms SLO: 95th %ile SLI for 1 month succeeds 99.9% 1M reqs in one month Error Budget = (1-0.999)*1M = 1k requests 1k requests can exceed 300ms
  19. 19. What is an Error Budget? @phredmoyer Chapter 3 Embracing Risk
  20. 20. Talk Agenda ● SLOs and Error Budgets ● Calculating Error Budgets with Logs ● Calculating Error Budgets with Metrics @phredmoyer
  21. 21. Calculating Error Budgets with Logs @phredmoyer Latency
  22. 22. Calculating Error Budgets with Logs - Latency @phredmoyer Error Budget = 100%-SLO = (1-0.999)*1M = 1k Error Budget = 1k requests/day > 300ms EventLog "%h %l %u %O "%{User-Agent}i" %D" %D - Request duration in milliseconds For each request: If duration > SLI (300ms), error_budget++
  23. 23. Calculating Error Budgets with Logs - Errors @phredmoyer Errors
  24. 24. Calculating Error Budgets with Logs - Errors @phredmoyer Error Budget = 1k requests/day > 300ms [Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server: /export/home/live/ap/htdocs/test For each error log entry, error_budget++ If req duration > SLI (300ms), error_budget++ Alert if error_budget/total_reqs > 80% * 1-SLO
  25. 25. Calculating Error Budgets with Logs @phredmoyer Cumulative sum functionality required ● Splunk ● ELK ● Mtail ○ https://github.com/google/mtail ● Honeycomb.io ● Circonus Logwatch ○ https://github.com/circonus- labs/circonus-logwatch
  26. 26. Talk Agenda ● SLOs and Error Budgets ● Calculating Error Budgets with Logs ● Calculating Error Budgets with Metrics @phredmoyer
  27. 27. Calculating Error Budgets with Metrics @phredmoyer Errors
  28. 28. Calculating Error Budgets with Metrics @phredmoyer Use a counter metric (uint32/uint64) Error Budget = 1k requests/day > 300ms For each app error, error_budget++ If req duration > SLI (300ms), error_budget++ Alert if error_budget/total_reqs > 80% * 1-SLO
  29. 29. Calculating Error Budgets with Metrics (and Logs) @phredmoyer Problems: ● SLI fixed threshold ● Inability to introspect historical data ● Difficult to compare different SLI behavior
  30. 30. Calculating Error Budgets with Metrics - Histograms @phredmoyer Use a histogram Image source http://www.brendangregg.com/FrequencyTrails/modes.html
  31. 31. Calculating Error Budgets with Metrics - Histograms @phredmoyer Linear, Cumulative, Log-Linear, Approximate… High dynamic range, log-linear recommended http://hdrhistogram.org/ https://github.com/circonus/-labs/circonusllhist
  32. 32. Calculating Error Budgets with Metrics - Histograms @phredmoyer Error Budget = 1k requests/day > Xms For each histogram bin >= X: error_budget += bin_sample_count Alert if error_budget/total_reqs > 80% * 1-SLO
  33. 33. Calculating Error Budgets with Metrics - Histograms @phredmoyer Choose bin boundary for SLI (preferred) or interpolate within boundaries
  34. 34. Calculating Error Budgets with Metrics - Histograms @phredmoyer Error Budget ~ 1k requests/day > 1,800µs
  35. 35. Calculating Error Budgets with Metrics - Histograms @phredmoyer Error Budget ~ 1k requests/day > 2,400µs
  36. 36. Calculating Error Budgets with Metrics - Histograms @phredmoyer Benefits: ● SLI variable threshold ● Ability to analyze historical data ● Examine error budgets for different SLIs
  37. 37. Talk Agenda ● SLOs and Error Budgets ● Calculating Error Budgets with Logs ● Calculating Error Budgets with Metrics @phredmoyer
  38. 38. Questions? ? @phredmoyer
  39. 39. Thanks! https://slideshare.net/redhotpenguin https://twitter.com/phredmoyer https://linkedin.com/in/redhotpenguin https://github.com/redhotpenguin @phredmoyer
  40. 40. Appendix - SLOs, How Do They Work? @phredmoyer ● Chapter 4 ○ Service Level Objectives ● 99% Get RPC calls < 100ms ● https://landing.google.com/sre/sre-book/toc/index.html
  41. 41. @phredmoyer ● Ch 2: Implementing SLOs ● Ch 3: SLO Eng case studies ● Ch 5: Alerting on SLOs ● https://landing.google.com/sre/workbook/toc Appendix - SLOs, How Do They Work?
  42. 42. @phredmoyer ● Chapter 21 ○ The Art and Science of The Service Level Objective Appendix - SLOs, How Do They Work?

×