14. When you start monitoring a new system itâs important to
understand what are the key metrics.
Ask your self basic questions to identify key metrics:
- Is the system up and responsive?
- How much more traffic / queries can it sustain?
#1 - Identify your golden signals
15. The RED method by Tom Wilkie:
#1 - Identify your golden signals
Request rate
Error rate
Duration
requests / sec
errors %
response time
traffic
failures
performances
16. The RED method by Tom Wilkie:
#1 - Identify your golden signals
Request rate
Error rate
Duration
requests / sec
errors %
response time
traffic
failures
performances
Then add saturation monitoring:
Ie. CPU, memory, I/O, ...
17. - Discover targets
- Scrape metrics
- Store metrics
- Query language
- Evaluate alerting rules
A quick recap about the architecture
#2 - Monitor your golden signals
App #1
App #3
App #2
App #4
prometheus alert manager
pull metrics push alerts
node
node
18. Request rate = requests / sec
#2 - Monitor your golden signals
#TYPE http_requests_total counter
http_requests_total{method="GET",handler="viewUser",status="200"} 80
- Counter incremented on each request received
- By method, handler and response status code
19. Request rate = requests / sec
#2 - Monitor your golden signals
sum(rate(http_requests_total[1m])) {} 7.02
Group by method
sum(rate(http_requests_total[1m])) by (method) {method="GET"} 6.10
{method="POST"} 0.92
20. Error rate = total number of errors / total requests
#2 - Monitor your golden signals
sum(rate(http_requests_total{status=~"(4|5).*"}[1m])) /
sum(rate(http_requests_total[1m]))
{} 0.02 = 2%
21. Error rate by method and handler
#2 - Monitor your golden signals
sum(rate(http_requests_total{status=~"(4|5).*"}[1m])) by (method, handler) /
sum(rate(http_requests_total[1m])) by (method, handler)
{method="GET",handler="viewUser"} 0.015
{method="POST",handler="editUser"} 0.005
22. #2 - Monitor your golden signals
Average response time = sum of response times / number of requests
#TYPE http_requests_duration_seconds counter
http_requests_duration_seconds{method="GET",handler="viewUser",status="200"} 4.5
- Sum of all the response times
- By method, handler and response status code
23. #2 - Monitor your golden signals
Average response time = sum of response times / number of requests
sum(increase(http_requests_duration_seconds[1m])) /
sum(increase(http_requests_total[1m]))
{} 0.075 = 75 ms
24. - Get alerts in input
- Route alerts to receivers
We use email, slack,
opsgenie⊠but supports
many more
#3 - Alert on golden signals
A quick recap about the architecture
App #1
App #3
App #2
App #4
prometheus alert manager
pull metrics push alerts
node
node
25. Alert on high error rate:
- Use % threshold
- Prefer without() over by() to keep an high observability
#3 - Alert on golden signals
ALERT HIGH_ERROR_RATE
ON sum(rate(http_requests_total{status=~"(4|5).*"}[1m])) without (status) /
sum(rate(http_requests_total[1m])) without(status)
> 0.01
FOR 5m
Prometheus v. 1 syntax
26. Alert on high response times:
- Use absolute value
- Prefer without() over by() to keep an high observability
ALERT HIGH_RESPONSE_TIMES
ON sum(increase(http_requests_duration_seconds[1m])) without (status) /
sum(increase(http_requests_total[1m])) without(status)
> 0.5
FOR 5m
#3 - Alert on golden signals
Prometheus v. 1 syntax
27. #4 - Dead targets
A quick recap about the architecture
PostgreSQL
Custom exporter
prometheus alert manager
pull metrics
push alerts
node
SQL queries
28. #4 - Dead targets
#TYPE postgres_up gauge
postgres_up{} 1
The exporter exports
ALERT POSTGRESQL_IS_DOWN
ON postgres_up == 0
FOR 5m
And we alert on it
29. #4 - Dead targets
What if Prometheus canât scrape metrics from a target?
PostgreSQL
Custom exporter
prometheus alert manager
pull metrics
push alerts
node
SQL queries
30. #4 - Dead targets
Prometheus will not scrape postgres_up{} 0 because the exporter is down,
and our previous alert will never fire
ALERT POSTGRESQL_IS_DOWN
ON postgres_up == 0 or absent(postgres_up)
FOR 5m
We can improve the alert with absent()
31. ALERT HIGH_ERROR_RATE_ON_FRONTEND
ON ⊠> 0.01
LABELS {
team="frontend",
severity="warning"
}
Use labels to define alertâs team and severity
#5 - Route alerts by team and severity
Prometheus v. 1 syntax
32. We support three levels of severity:
warning
error
critical
Slack
Slack + Email
Slack + Email + SMS / Phone call
#5 - Route alerts by team and severity
next business day
daylight (weekend included)
immediately
33. route:
routes:
# Team specific alerts
- match:
team: frontend
routes:
- match_re:
severity: critical
receiver: page-frontend-team-by-opsgenie
continue: true
- match_re:
severity: critical|error
receiver: page-frontend-team-by-email
continue: true
- receiver: page-frontend-team-by-slack
continue: false
Use child routes to route by team first, then severity:
#5 - Route alerts by team and severity
Route by team first
If team matches, enter the child routes
Send critical via opsgenie
Send critical and error via email
Always send via slack
If team did match, stop here
34. route:
routes:
# Team specific alerts
# ...
# Fallback to ops team
- match_re:
severity: critical
receiver: page-ops-team-by-opsgenie
continue: true
- match_re:
severity: error|critical
receiver: page-ops-team-by-email
continue: true
- receiver: page-ops-team-by-slack
If no team matches, fallback to ops team:
#5 - Route alerts by team and severity
Send critical via opsgenie
Send critical and error via email
Always send via slack
35. Document manual operations in an easy to read playbook,
and link it to the alert using ANNOTATIONS
#6 - Associate playbooks to alerts
ALERT HIGH_ERROR_RATE_ON_FRONTEND
ON ⊠> 0.01
LABELS { team="frontend", severity="warning" }
ANNOTATIONS {
playbook="https://doc.spreaker.com/playbooks/high-error-rate-on-frontend"
}
Prometheus v. 1 syntax
36. Customize the alert messages, displaying the playbook too.
#6 - Associate playbooks to alerts
37. Both labels and annotations allow you to attach metadata to your alerts.
#7 - Labels and Annotations
ALERT HIGH_ERROR_RATE_ON_FRONTEND
ON ⊠> 0.01
LABELS {
team="frontend",
severity="warning"
}
ANNOTATIONS {
playbook="https://..."
}
Prometheus v. 1 syntax
LABELS
- Information to identify an alert
- Read by a machine
ANNOTATIONS
- Extra information for the receiver
(ie. description)
- Read by an human
38. To recap
1. Keep it simple
2. Focus on metrics that bring you value
3. Ensure each alert is actionable
4. Write playbooks for manual intervention
5. Do not alert at all if you can automize the resolution