Car Alarms & Smoke Alarms [Monitorama]

Car Alarms &
Smoke Alarms
& Monitoring

Who’s this punk?
• Dan Slimmon
• @danslimmon on the Twitters
• Senior Platform Engineer at Exosite
• Previously Operations Team Manager at
Blue State Digital

Learn to do some stats and
visualization.
You’ll be right much more often, &
people will THINK you’re right even
more often than that!

A word problem
You’ve invented an automated test for
plagiarism.

• Plagiarism: 90% chance of positive
• No Plagiarism: 20% chance of positive
• Jerkwad kids plagiarize 30% of the time
A word problem

Question 1
Given a random paper, what’s the probability
that you’ll get a negative result?
• No Plagiarism: 20% chance of positive
• 30% chance of plagiarism

Question 2
If there’s plagiarism, what’s the probability
PLAJR will detect it?
• No plagiarism: 20% chance of positive

Question 2
If there’s plagiarism, what’s the probability
you’ll detect it?

Question 3
If you get a positive result, what’s the
probability that the paper is plagiarized?

No Plagiarism
Negative
Positive

No Plagiarism
Negative
Positive
Plagiarism
Negative
Positive

Question 1
Given a random paper, what’s the probability
that you’ll get a negative result?

Question 2
If the paper is plagiarized, what’s the
probability that you’ll get a positive result?

Question 3
probability that the paper was plagiarized?

Question 3
Dark Green
------------------------------------------
(Dark Blue) + (Dark Green)

Question 3
27
------------------------------------------
14 + 27

Question 3
65.8%

Sensitivity & Specificity
Sensitivity:
% of actual positives
that are identified as
such
Specificity:
% of actual negatives
that are identified as
such

Sensitivity:
High sensitivity
Test is very sensitive
to problems
Specificity:
High specificity
Test works for a
specific type of
problem

Specificity:
Probability that, if a
paper isn’t
plagiarized, you’ll
get a negative.
Sensitivity:
Probability that, if a
paper is plagiarized,
you’ll get a positive.
90% 80%

Specificity
Sensitivity
Prevalence

http://i.imgur.com/
LkxcxLt.png

Positive Predictive Value
The probability that
If you get a positive result,
Then it’s a true positive.

When you get paged at 3
AM, Positive Predictive
Value is the probability that
something is actually
wrong.

Imagine if you will...
• Service has 99.9% uptime
• Probe has 99% sensitivity
• Probe has 99% specificity

True
Negative
False
Negative
False
Positive
True
Positive
Positive
Result
Negative
Result
Condition
Present
Condition
Absent

The true-positive probability
P(TP) = (prob. of service failure) * (sensitivity)
P(TP) = 0.1% * 99%
P(TP) = 0.099%
Let’s calculate the probability that any given
probe run will produce a true positive.

The true-positive probability
P(TP) = 0.099%
So roughly 1 in every 1000 checks will be a
true positive.

The false-positive probability
P(FP) = (prob. working) * (100% - specificity)
P(FP) = 99.9% * 1%
P(FP) = 0.99%
So roughly 1 in every 100 checks will be a
false positive.

Positive predictive value
PPV = P(TP) / [P(TP) + P(FP)]
PPV = 0.099% / (0.099% + 0.99%)
PPV = 9.1%
If you get a positive, there’s only a 1 in 10
chance that something’s actually wrong.

Car Alarms
http://inserbia.info/news/wp-content/uploads/2013/06/carthief.jpg

Smoke Alarms
http://www.props.eric-hart.com/wp-content/uploads/2011/03/nysf_firedrill_2011.jpg

You want smoke alarms,
not car alarms.

Why do we have such
noisy checks?

Monty Python’s Flying Circus, 1975.

Semi-Practical Advice
Undetected outages are embarrassing, so we
tend to focus on sensitivity.
That’s good.
But be careful with thresholds.

Response Time Threshold
Positive
Predictive
Value

Get more degrees of freedom.

Hysteresis is a great way to add degrees of
freedom.
• State machines
• Time-series analysis

As your uptime increases, so must your
specificity.
It affects your PPV much more than
sensitivity.

Specificity
Sensitivity
Uptime Prevalence
False
Positive
Rate
False
Negative
Rate

Specificity
Sensitivity
Uptime

Separate the concerns of problem detection
and problem identification

• Check Apache process count
• Check swap usage
• Check median HTTP response time
• Check requests/second

Your alerting should tell you
whether work is getting
done.
Baron Schwartz
(paraphrased)

• Check Apache process count
• Check swap usage
• Check median HTTP response time &
requests/second

A Pony I Want
Something like Nagios, but which
• Helps you separate detection from diagnosis
• Is SNR-aware

• Medical paper with a nice
visualization:http://tinyurl.com/specsens
• Blog post with some algebra:
http://tinyurl.com/carsmoke
• Base rate
fallacy:http://tinyurl.com/brfallacy
• Bischeck:http://tinyurl.com/bischeck
Other useful stuff

Car Alarms & Smoke Alarms [Monitorama]

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Car Alarms & Smoke Alarms [Monitorama]

Ähnlich wie Car Alarms & Smoke Alarms [Monitorama] (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Car Alarms & Smoke Alarms [Monitorama]