Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

The Dark Art of Production Alerting

1.592 Aufrufe

Veröffentlicht am

Veröffentlicht in: Technologie, Design
  • Als Erste(r) kommentieren

The Dark Art of Production Alerting

  1. 1. The Dark Art of Building a Production Incide nt Syste m @AloisReitbauer www.ruxit.com
  2. 2. No broken cables
  3. 3. No datacenter fires
  4. 4. Other things can happen as well Continuous deployments Infrastructure changes other “everyday” stuff
  5. 5. Scaling an incident system
  6. 6. How it feels to do what we do
  7. 7. Do you alert? Typical error rate of 3 percent at 10.000 transactions/min During the night we now have 5 errors in 100 requests.
  8. 8. Do you alert? Typical response time has been around 300 ms. Now we see response times up to 600 ms.
  9. 9. We are good at fixing problems, but not really good at detecting them.
  10. 10. How can we get better? .
  11. 11. It is all about statistics It ’s all about statistics
  12. 12. Statistics is about objectively lying to yourself in a meaningful way.
  13. 13. How to design an incident
  14. 14. How to calculate this value? It looks really simple Which metric to pick? How to get this baseline? How to define that this happened?
  15. 15. Which metrics to pick?
  16. 16. Three types of metrics Capacity Metrics Define how much of a resource is used. Discrete Metrics Simple countable things, like errors or users. Continuous Metrics Metrics represented by a range of values at any given time.
  17. 17. Capacity Metrics Good for capacity planning, not so good for production alerting
  18. 18. Connection Pools
  19. 19. better use Connection acquisition time Tells you, whether anyone needed a connection and did not get it.
  20. 20. CPU Usage
  21. 21. better use Combination of Load Average and CPU usage even better correlate the with response times of applications
  22. 22. Discrete Metrics Pretty easy to track and analyze.
  23. 23. Continuous Metrics Require some extra work as they are not that easy to track.
  24. 24. Continuous Metrics – The hope 42
  25. 25. Continuous Metrics – The reality
  26. 26. What the average tells us
  27. 27. What the median tells us
  28. 28. How to get a baseline?
  29. 29. A baseline is not a number Baselines define the range of a value combined with a probability
  30. 30. Normal distribution as baseline Mean: 500 ms Std. Dev.: 100 ms 68 % 400ms – 600 ms 95 % 300ms – 700 ms 0 100 200 300 400 500 600 700 800 900 99 % 200ms – 800 ms
  31. 31. This can go really wrong “Why alerts suck and monitoring solutions need to become better”
  32. 32. How this leads to false alerts
  33. 33. Many false alerts Aggressive Baseline
  34. 34. No alerts at all Moderate Baseline
  35. 35. Find the right distribution model However, this can be really hard to impossible
  36. 36. Your distribution might look like this
  37. 37. … or like this
  38. 38. or completely different you never know …
  39. 39. How can we solve this problem?
  40. 40. Norm al distribution - again 50 Percent slower than μ 97.6 Percent slower than μ + 2σ Median 97th Percentile
  41. 41. The 50th and 90th percentile define normal behavior without needing to know anything about the distribution model
  42. 42. Median shows the real problem
  43. 43. How to define non- normal behavior?
  44. 44. Fortunately, this is not the problem we need to solve We are only talking about missed expectations
  45. 45. Let’s look at two scenarios Errors Is a certain error rate likely to happen or not? Response Times Is a certain increase in response time significant enough to trigger an incident?
  46. 46. The error rate scenario We have a typical error rate of 3 percent at 10.000 transactions/minute During the night we now have 5 errors in 100 requests. Should we alert – or not?
  47. 47. What can we learn
  48. 48. Statistics is everwhere
  49. 49. Binomial Distribution Tells us how likely it is to see n successes in a certain number of trials
  50. 50. How many errors are ok? 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% 120.0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Likeliness of at least n errors 18 % probability to see 5 or more errors. Which is within 2 times Std. Deviation. We do not alert.
  51. 51. Response Time Example Our median response time is 300 ms and we measure 200 ms 400 ms 350 ms 200 ms 600 ms 500 ms 150 ms 350 ms 400 ms 600 ms
  52. 52. Percentile Drift Detection
  53. 53. Did the median drift significantly? Check all values above 300 ms 200 ms 400 ms 350 ms 200 ms 600 ms 500 ms 150 ms 350 ms 400 ms 600 ms 7 values are higher than the median. Is this normal? We can again use the Binomial Distribution
  54. 54. Applying the Binom ial Distribution We have a 50 percent likeliness to see values above the median. How likely is is that 7 out of 10 samples are higher? The probability is 17 percent, so we should not alert.
  55. 55. How to calculate this value? … and we are done! Which metric to pick? How to get this baseline? How to define that this happened?
  56. 56. This was just the beginning There are many more use things about statistics, probabilities, testing, ….
  57. 57. Alois Reitbauer alois.reitbauer@ruxit.com @AloisReitbauer http://bit.ly/bostonwebperf
  58. 58. ImageCredits http://commons.wikimedia.org/wiki/File:Network_switches.jpg http://commons.wikimedia.org/wiki/File:Wheelock_mt.jpg http://commons.wikimedia.org/wiki/File:Fire-lite-bg-10.jpg http://commons.wikimedia.org/wiki/File:Estacaobras.jpg http://commons.wikimedia.org/wiki/File:Speedo_angle.jpg http://commons.wikimedia.org/wiki/File:WelcomeToVegasNite.JPG http://commons.wikimedia.org/wiki/File:Dice_02138.JPG http://commons.wikimedia.org/wiki/File:Teadlased_j%C3%A4%C3%A4l.jpg

×