The Dark Art of Production Alerting

The Dark Art of Building a
Production Incide nt
Syste m
@AloisReitbauer
www.ruxit.com

Other things can happen
as well
Continuous deployments
Infrastructure changes
other “everyday” stuff

Do you alert?
Typical error rate of 3 percent at
10.000 transactions/min
During the night we now have 5
errors in 100 requests.

Do you alert?
Typical response time has been
around 300 ms.
Now we see response times up to
600 ms.

We are good at fixing
problems, but not really
good at detecting them.

It is all about statistics
It ’s all about statistics

Statistics is about
objectively lying to yourself
in a meaningful way.

How to calculate
this value?
It looks really simple
Which metric
to pick?
How to get
this baseline?
How to define that
this happened?

Three types of metrics
Capacity Metrics
Define how much of a resource is used.
Discrete Metrics
Simple countable things, like errors or users.
Continuous Metrics
Metrics represented by a range of values at any given time.

Capacity Metrics
Good for capacity planning, not so good for
production alerting

better use
Connection acquisition time
Tells you, whether anyone needed a connection
and did not get it.

better use
Combination of Load Average and CPU usage
even better correlate the with response times of
applications

Discrete Metrics
Pretty easy to track and analyze.

Continuous Metrics
Require some extra work as they are not that
easy to track.

Continuous Metrics – The hope
42

Continuous Metrics – The
reality

A baseline is not a number
Baselines define the range of a value combined
with a probability

Normal distribution as baseline
Mean: 500 ms
Std. Dev.: 100 ms
68 %
400ms – 600 ms
95 %
300ms – 700 ms
0 100 200 300 400 500 600 700 800 900
99 %
200ms – 800 ms

This can go really wrong
“Why alerts suck and monitoring solutions need to become better”

How this leads to false
alerts

Many false alerts
Aggressive Baseline

No alerts at all
Moderate Baseline

Find the right distribution model
However, this can be really hard to impossible

Your distribution might look like this

or completely different
you never know …

How can we solve this
problem?

Norm al distribution - again
50 Percent slower than μ
97.6 Percent slower than μ + 2σ
Median
97th Percentile

The 50th and 90th percentile
define normal behavior
without needing
to know anything about the
distribution model

How to define non-
normal behavior?

Fortunately, this is not the problem we
need to solve
We are only talking about missed expectations

Let’s look at two scenarios
Errors
Is a certain error rate likely to happen or not?
Response Times
Is a certain increase in response time significant
enough to trigger an incident?

The error rate scenario
We have a typical error rate of 3 percent at
10.000 transactions/minute
During the night we now have 5 errors in 100
requests. Should we alert – or not?

Binomial Distribution
Tells us how likely it is to see n successes in a
certain number of trials

How many errors are ok?
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Likeliness of at least n errors
18 % probability to see 5 or more
errors. Which is within 2 times Std.
Deviation. We do not alert.

Response Time Example
Our median response time is 300 ms
and we measure
200 ms 400 ms 350 ms 200 ms 600 ms
500 ms 150 ms 350 ms 400 ms 600 ms

Did the median drift
significantly?
Check all values above 300 ms
200 ms 400 ms 350 ms 200 ms 600 ms
500 ms 150 ms 350 ms 400 ms 600 ms
7 values are higher than the median. Is this normal?
We can again use the Binomial Distribution

Applying the Binom ial
Distribution
We have a 50 percent likeliness to see values above the
median.
How likely is is that 7 out of 10 samples are higher?
The probability is 17 percent, so we should not alert.

How to calculate
this value?
… and we are done!
Which metric
to pick?
How to get
this baseline?
How to define that
this happened?

This was just the beginning
There are many more use things about statistics,
probabilities, testing, ….

Alois Reitbauer
alois.reitbauer@ruxit.com
@AloisReitbauer
http://bit.ly/bostonwebperf

ImageCredits
http://commons.wikimedia.org/wiki/File:Network_switches.jpg
http://commons.wikimedia.org/wiki/File:Wheelock_mt.jpg
http://commons.wikimedia.org/wiki/File:Fire-lite-bg-10.jpg
http://commons.wikimedia.org/wiki/File:Estacaobras.jpg
http://commons.wikimedia.org/wiki/File:Speedo_angle.jpg
http://commons.wikimedia.org/wiki/File:WelcomeToVegasNite.JPG
http://commons.wikimedia.org/wiki/File:Dice_02138.JPG
http://commons.wikimedia.org/wiki/File:Teadlased_j%C3%A4%C3%A4l.jpg

The Dark Art of Production Alerting

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to The Dark Art of Production Alerting

Similar to The Dark Art of Production Alerting (20)

More from Alois Reitbauer

More from Alois Reitbauer (13)

Recently uploaded

Recently uploaded (20)

The Dark Art of Production Alerting

Editor's Notes