Migrating to Prometheus: what we learned running it in production

Migrating to Prometheus
What we learned
running it in production
Marco Pracucci @pracucci 22 November, 2017

I’m Marco - Software engineer moved to the
DevOps side
- Co-founder and former CTO at
Spreaker

3
Podcasting platform
Create, host, distribute and monetize your podcast

Why monitoring?
- Failures detection and alerting
- Get insights when things go wrong
- Analyze trends over the time

Our setup
The Spreaker
infrastructure spans
over 3 AWS regions

Our setup
We run Prometheus in each region,
to keep it close to
the monitored targets

Infrastructure
- Nodes
- Kubernetes cluster health
- VPN
- Logging
- Backups
- ...
What we monitor

Infrastructure
Applications
- Our applications (both running on VM and containers)
- Third-party applications (PostgreSQL, RabbitMQ, Redis, …)
What we monitor

Infrastructure
Applications
External providers
Key business metrics
What we monitor

Simple yet powerful
Backed by a time-series database
A query language we love
Why Prometheus?

What we learned
using it
(hope something will make sense to you too)

Examples are on
Prometheus v. 1
(but same concepts apply to Prometheus v. 2)

When you start monitoring a new system it’s important to
understand what are the key metrics.
Ask your self basic questions to identify key metrics:
- Is the system up and responsive?
- How much more traffic / queries can it sustain?
#1 - Identify your golden signals

The RED method by Tom Wilkie:
Request rate
Error rate
Duration
requests / sec
errors %
response time
traffic
failures
performances

The RED method by Tom Wilkie:
Request rate
Error rate
Duration
requests / sec
errors %
response time
traffic
failures
performances
Then add saturation monitoring:
Ie. CPU, memory, I/O, ...

- Discover targets
- Scrape metrics
- Store metrics
- Query language
- Evaluate alerting rules
A quick recap about the architecture
#2 - Monitor your golden signals
App #1
App #3
App #2
App #4
prometheus alert manager
pull metrics push alerts
node
node

Request rate = requests / sec
#TYPE http_requests_total counter
http_requests_total{method="GET",handler="viewUser",status="200"} 80
- Counter incremented on each request received
- By method, handler and response status code

Request rate = requests / sec
sum(rate(http_requests_total[1m])) {} 7.02
Group by method
sum(rate(http_requests_total[1m])) by (method) {method="GET"} 6.10
{method="POST"} 0.92

Error rate = total number of errors / total requests
sum(rate(http_requests_total{status=~"(4|5).*"}[1m])) /
sum(rate(http_requests_total[1m]))
{} 0.02 = 2%

Error rate by method and handler
sum(rate(http_requests_total{status=~"(4|5).*"}[1m])) by (method, handler) /
sum(rate(http_requests_total[1m])) by (method, handler)
{method="GET",handler="viewUser"} 0.015
{method="POST",handler="editUser"} 0.005

Average response time = sum of response times / number of requests
#TYPE http_requests_duration_seconds counter
http_requests_duration_seconds{method="GET",handler="viewUser",status="200"} 4.5
- Sum of all the response times
- By method, handler and response status code

Average response time = sum of response times / number of requests
sum(increase(http_requests_duration_seconds[1m])) /
sum(increase(http_requests_total[1m]))
{} 0.075 = 75 ms

- Get alerts in input
- Route alerts to receivers
We use email, slack,
opsgenie… but supports
many more
#3 - Alert on golden signals
App #1
App #3
App #2
App #4
pull metrics push alerts
node
node

Alert on high error rate:
- Use % threshold
- Prefer without() over by() to keep an high observability
ALERT HIGH_ERROR_RATE
ON sum(rate(http_requests_total{status=~"(4|5).*"}[1m])) without (status) /
sum(rate(http_requests_total[1m])) without(status)
> 0.01
FOR 5m
Prometheus v. 1 syntax

Alert on high response times:
- Use absolute value
- Prefer without() over by() to keep an high observability
ALERT HIGH_RESPONSE_TIMES
ON sum(increase(http_requests_duration_seconds[1m])) without (status) /
sum(increase(http_requests_total[1m])) without(status)
> 0.5
FOR 5m

#4 - Dead targets
PostgreSQL
Custom exporter
pull metrics
push alerts
node
SQL queries

#4 - Dead targets
#TYPE postgres_up gauge
postgres_up{} 1
The exporter exports
ALERT POSTGRESQL_IS_DOWN
ON postgres_up == 0
FOR 5m
And we alert on it

#4 - Dead targets
What if Prometheus can’t scrape metrics from a target?
PostgreSQL
Custom exporter
pull metrics
push alerts
node
SQL queries

#4 - Dead targets
Prometheus will not scrape postgres_up{} 0 because the exporter is down,
and our previous alert will never fire
ALERT POSTGRESQL_IS_DOWN
ON postgres_up == 0 or absent(postgres_up)
FOR 5m
We can improve the alert with absent()

ALERT HIGH_ERROR_RATE_ON_FRONTEND
ON … > 0.01
LABELS {
team="frontend",
severity="warning"
}
Use labels to define alert’s team and severity
#5 - Route alerts by team and severity

We support three levels of severity:
warning
error
critical
Slack
Slack + Email
Slack + Email + SMS / Phone call
next business day
daylight (weekend included)
immediately

route:
routes:
# Team specific alerts
- match:
team: frontend
routes:
- match_re:
severity: critical
receiver: page-frontend-team-by-opsgenie
continue: true
- match_re:
severity: critical|error
receiver: page-frontend-team-by-email
continue: true
- receiver: page-frontend-team-by-slack
continue: false
Use child routes to route by team first, then severity:
Route by team first
If team matches, enter the child routes
Send critical via opsgenie
Send critical and error via email
Always send via slack
If team did match, stop here

route:
routes:
# Team specific alerts
# ...
# Fallback to ops team
- match_re:
severity: critical
receiver: page-ops-team-by-opsgenie
continue: true
- match_re:
severity: error|critical
receiver: page-ops-team-by-email
continue: true
- receiver: page-ops-team-by-slack
If no team matches, fallback to ops team:
Send critical via opsgenie
Send critical and error via email
Always send via slack

Document manual operations in an easy to read playbook,
and link it to the alert using ANNOTATIONS
#6 - Associate playbooks to alerts
ON … > 0.01
LABELS { team="frontend", severity="warning" }
ANNOTATIONS {
playbook="https://doc.spreaker.com/playbooks/high-error-rate-on-frontend"
}

Customize the alert messages, displaying the playbook too.
#6 - Associate playbooks to alerts

Both labels and annotations allow you to attach metadata to your alerts.
#7 - Labels and Annotations
ON … > 0.01
LABELS {
team="frontend",
severity="warning"
}
ANNOTATIONS {
playbook="https://..."
}
LABELS
- Information to identify an alert
- Read by a machine
ANNOTATIONS
- Extra information for the receiver
(ie. description)
- Read by an human

To recap
1. Keep it simple
2. Focus on metrics that bring you value
3. Ensure each alert is actionable
4. Write playbooks for manual intervention
5. Do not alert at all if you can automize the resolution

Thanks
Questions?
Marco Pracucci
If you liked it, follow me on Twitter:
@pracucci

Migrating to Prometheus: what we learned running it in production

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Migrating to Prometheus: what we learned running it in production

Ähnlich wie Migrating to Prometheus: what we learned running it in production (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Migrating to Prometheus: what we learned running it in production