More than five years ago, Rob Ewaschuk created an innocuous Google doc titled “My Philosophy On Alerting”. It became kind of viral and later formed the foundation of a chapter in the famous book Site Reliability Engineering – How Google Runs Production Systems. In parallel, the metrics-based monitoring and alerting system Prometheus was developed at SoundCloud. It is the open-source tool to put Rob’s philosophy into practice. Thus, I would like to present “applied alerting philosophy” and explain how we use Prometheus at SoundCloud to create meaningful and actionable alerts. In particular, SoundCloud follows a fairly radical “you build it – you run it” approach, which requires additional care to route alerts to the right group of engineers. Prometheus’s “label everything” mantra proves to be very helpful here.
7. My Philosophy on Alerting
based my observations while I was a Site Reliability Engineer at Google
Author: Rob Ewaschuk <rob@infinitepigeons.org>
Introduction
Vernacular
Monitor for your users
Cause-based alerts are bad (but sometimes necessary)
Alerting from the spout (or beyond!)
Causes are still useful
Tickets, Reports and Email
Playbooks
Tracking & Accountability
You're being naïve!
Summary
Summary
When you are auditing or writing alerting rules, consider these things to keep your oncall rotation happier:
goo.gl/2vrpSO
8.
9.
10.
11.
12.
13.
14. “What” versus “why” is one of the most important
distinctions in writing good monitoring with maximum
signal and minimum noise.
Chapter 6: Monitoring Distributed Systems
Symptoms vs. causes
Source: Betsy Beyer et al. “Site Reliability Engineering – How Google Runs Production Systems”
15. Expected response SRE book SoundCloud lingo Delivered to Based on
Act immediately Alerts Pages
severity="critical"
Pager Symptoms
Act eventually Tickets Tickets / “email alerts”
severity="warning"
Issue tracker /
Chat / email :-(
Symptoms or
causes
None (for diagnostics
only)
Logs Informational alerts
severity="info"
Nowhere /
dashboards
Causes
“Alerts” according to Prometheus:
Pages vs. tickets
16. “One person’s symptom is another person’s cause.”
“Not-yet-occurring but imminent problems.”
“Zero-redundancy (N + 0) situations count as imminent,
as do ‘nearly full’ parts of your service!”
What also counts as “symptoms”…
18. Probing with real user traffic in multi-tiered services
Frontend service
(instrumented)
Backend service
A
Backend service
B
User
traffic
Measures
A’s and B’s
latency,
rps,
errors…
Alerts owners
of A or B
22. - record: backend:http_errors_per_response:ratio_rate5m
expr: |2
sum by (backend)(rate(
haproxy_backend_http_responses_total{job="ampelmann", code="5xx"}[5m]
))
/
sum by (backend)(rate(
haproxy_backend_http_responses_total{job="ampelmann"}[5m]
))