The document discusses building an effective production incident system using statistics. It explains that using the median and percentiles to define a baseline range captures normal system behavior better than trying to fit a specific distribution model. Two examples are provided: 1) Using the binomial distribution to determine if an error rate exceeds expectations. 2) Using percentiles to check if response times have drifted above the median without knowing the underlying distribution. The key is applying statistical methods to objectively determine what constitutes a normal range of values versus a problem requiring alerting.
16. Three types of metrics
Capacity Metrics
Define how much of resource is used.
Discrete Metrics
Simple countable things, like errors or users.
Continuous Metrics
Metrics represented by a range of values at any
given time.
29. A baseline is not a number
Baselines define the range of a value combined
with a probability
30. Normal distribution as baseline
Mean: 500 ms
Std. Dev.: 100 ms
0
100
200
300
400
500
600
68 %
400ms – 500 ms
95 %
300ms – 700 ms
99 %
200ms – 800 ms
700
800
900
31. This can go really wrong
“Why alerts suck and monitoring solutions need to become better”
44. Fortunately this is not the
problem we need to solve
We are only talking about missed expectations
45. Let’s look at two scenarios
Errors
Is a certain error rate likely to happen or not?
Response Times
Is a certain increase in response time significant
enough to trigger an incident?
46. The error rate scenario
We have a typical error rate of 3 percent at
10.000 transactions/minute
During the night we now have 5 errors in 100
requests. Should we alert – or not?
49. B i n o m i a l D i st r i b u t i o n
Tells us how likely it is to see n successes in a
certain number of trials
50. How many errors are ok?
Likeliness of at least n errors
120.0%
18 % probability to see 5 or
more errors. Which is within 2
times Std. Deviation. We do not
alert.
100.0%
80.0%
60.0%
40.0%
20.0%
0.0%
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
51. Response Time Example
Our median response time is 300 ms
and we measure
200 ms
500 ms
400 ms
150 ms
350 ms
350 ms
200 ms
400 ms
600 ms
600 ms
53. Did the median drift
significantly?
Check all values above 300 ms
200 ms
500 ms
400 ms
150 ms
350 ms
350 ms
200 ms
400 ms
600 ms
600 ms
7 values are higher than the median. Is this normal?
We can again use the Binomial Distribution
54. Applying the Binomial
Distribution
We have a 50 percent likeliness to see values
above the median.
How likely is is that 7 out of 10 samples are higher?
The probability is 17 percent, so we should not alert.
55. … and we are done!
How to calculate
this value?
Which metric
to pick?
How to get
this baseline?
How to define that
this happened?
56. This was just the beginning
There are many more use things about statistics,
probabilities, testing, ….