Nobody likes false negatives. When your Nagios probes fail to detect a problem, it can hurt your sales, your reputation, and even your ego (especially your ego). The solution: tune the thresholds. Right? You can handle a couple spurious late-night pages if it means you’ll reliably detect real failures.
I will argue that – while easy – exchanging false negatives for false positives does more harm than good. Borrowing the medical concepts of specificity and sensitivity, I’ll show how deceptive this tradeoff can be. I’ll also make the case that putting in the extra effort to minimize both types of falsehoods is necessary and healthy. When the alarm goes off, you shouldn’t have to spend precious minutes sniffing for smoke.
2. Who’s this punk?
• Dan Slimmon
• @danslimmon on the Twitters
• Senior Platform Engineer at Exosite
• Previously Operations Team Manager at
Blue State Digital
3.
4. Learn to do some stats and
visualization.
You’ll be right much more often, &
people will THINK you’re right even
more often than that!
8. • Plagiarism: 90% chance of positive
• No Plagiarism: 20% chance of positive
• Jerkwad kids plagiarize 30% of the time
A word problem
9. Question 1
Given a random paper, what’s the probability
that you’ll get a negative result?
• Plagiarism: 90% chance of positive
• No Plagiarism: 20% chance of positive
• 30% chance of plagiarism
10. Question 2
If there’s plagiarism, what’s the probability
PLAJR will detect it?
• Plagiarism: 90% chance of positive
• No plagiarism: 20% chance of positive
• 30% chance of plagiarism
11. Question 2
If there’s plagiarism, what’s the probability
you’ll detect it?
• Plagiarism: 90% chance of positive
• No plagiarism: 20% chance of positive
• 30% chance of plagiarism
12. Question 3
If you get a positive result, what’s the
probability that the paper is plagiarized?
• Plagiarism: 90% chance of positive
• No plagiarism: 20% chance of positive
• 30% chance of plagiarism
22. Question 3
If you get a positive result, what’s the
probability that the paper was plagiarized?
Dark Green
------------------------------------------
(Dark Blue) + (Dark Green)
23. Question 3
If you get a positive result, what’s the
probability that the paper was plagiarized?
27
------------------------------------------
14 + 27
24. Question 3
If you get a positive result, what’s the
probability that the paper was plagiarized?
65.8%
27. Specificity:
Probability that, if a
paper isn’t
plagiarized, you’ll
get a negative.
Sensitivity & Specificity
Sensitivity:
Probability that, if a
paper is plagiarized,
you’ll get a positive.
90% 80%
36. The true-positive probability
P(TP) = (prob. of service failure) * (sensitivity)
P(TP) = 0.1% * 99%
P(TP) = 0.099%
Let’s calculate the probability that any given
probe run will produce a true positive.
38. The false-positive probability
P(FP) = (prob. working) * (100% - specificity)
P(FP) = 99.9% * 1%
P(FP) = 0.99%
So roughly 1 in every 100 checks will be a
false positive.
39.
40. Positive predictive value
PPV = P(TP) / [P(TP) + P(FP)]
PPV = 0.099% / (0.099% + 0.99%)
PPV = 9.1%
If you get a positive, there’s only a 1 in 10
chance that something’s actually wrong.
59. Semi-Practical Advice
• Check Apache process count
• Check swap usage
• Check median HTTP response time
• Check requests/second
60. Your alerting should tell you
whether work is getting
done.
Baron Schwartz
(paraphrased)
61. Semi-Practical Advice
• Check Apache process count
• Check swap usage
• Check median HTTP response time
• Check requests/second
62. Semi-Practical Advice
• Check Apache process count
• Check swap usage
• Check median HTTP response time &
requests/second
63. A Pony I Want
Something like Nagios, but which
• Helps you separate detection from diagnosis
• Is SNR-aware
64. • Medical paper with a nice
visualization:http://tinyurl.com/specsens
• Blog post with some algebra:
http://tinyurl.com/carsmoke
• Base rate
fallacy:http://tinyurl.com/brfallacy
• Bischeck:http://tinyurl.com/bischeck
Other useful stuff