SlideShare ist ein Scribd-Unternehmen logo
1 von 74
Downloaden Sie, um offline zu lesen
PromCon Munich
Julien Pivotto
@roidelapluie
Improved alerting with
Prometheus and
Alertmanager
November 8th, 2019
• This talk contains PromQL.
• This talk contains YAML.
• What you will see was built over time.
Important notes!
@roidelapluie
Context
@roidelapluie
Message Broker in the Belgian healthcare sector
• High visibility
• Sync & Async
• Legacy & New
• Lots of partners
• Multiple customers
Message Broker
@roidelapluie
Technical
Business
Monitoring
@roidelapluie Font Awesome CC-BY-4.0
Alerts are not only for incidents.
Some alerts carry business information about ongoing
events (good or bad).
Some alerts go outside of our org.
Some alerts are not for humans.
Alerting
@roidelapluie
Channels
@roidelapluie Font Awesome CC-BY-4.0
Repeat every 15m, 1h, 4h, 24h, 2d
24x7, 10x5, 12x6, 10x7, never
Legal holidays
Time frames
@roidelapluie
Updated annotations & value
Updated graphs
15m/1h repeat interval?
@roidelapluie
• Alertmanager owns the noti cations
• Webhook receivers have no logic
• Take decisions at time of alert writing
Constraints
@roidelapluie
• Avoid Alertmanager recon gurations
• Safe and easy way to write alerts
• Only send relevant alerts
• Alert on staging environments
Challenges
@roidelapluie
PromQL
@roidelapluie
- alert: a target is down
expr: up == 0
for: 5m
Gauges
@roidelapluie
Gauges
@roidelapluie
Instead of:
- alert: a target is down
expr: up == 0
for: 5m
Do:
- alert: a target is down
expr: avg_over_time(up[5m]) < .9
for: 5m
Gauges
@roidelapluie
Alert me if temperature is above 27°C
Hysteresis
@roidelapluie
- alert: temperature is above threshold
expr: temperature_celcius > 27
for: 5m
labels:
priority: high
Hysteresis
@roidelapluie
Hysteresis
@roidelapluie
Hysteresis is the dependence of the state of a system
on its history.
Hysteresis
@roidelapluie Wikipedia CC-BY-SA-3.0
- alert: temperature is above threshold
expr: |
avg_over_time(temperature_celcius[5m])
> 27
for: 5m
labels:
priority: high
alternative: max_over_time
5m might be too short
if > 5m: when is it resolved?
Hysteresis
@roidelapluie
Alert me
• if temperature is above 27°C
• only stop when it gets below 25°C
Hysteresis
@roidelapluie
(avg_over_time(temperature_celcius[5m]) > 27)
or (temperature_celcius > 25 and
count without (alertstate, alertname, priority)
ALERTS{
alertstate="firing",
alertname="temperature is above threshold"
})
Hysteresis
@roidelapluie
temperature_celcius > 27
but...
Computed threshold
@roidelapluie
- record: temperature_threshold_celcius
expr: |
27+0*temperature_celcius{
location=~".*ambiant"
}
or 25+0*temperature_celcius
Bonus: temperature_threshold_celcius can be
used in grafana!
Computed threshold
@roidelapluie
- alert: temperature is above threshold
expr: |
temperature_celcius >
temperature_threshold_celcius
Note: put threshold & alert
in the same alert group
Computed threshold
@roidelapluie
- alert: no more sms
expr: sms_available < 39000
Absence
@roidelapluie
Absence
@roidelapluie
No metric = No alert!
Metric is back = New alert!
Absence
@roidelapluie
- record: sms_available_last
expr: |
sms_available or
sms_available_last
- alert: no more sms
record: sms_available_last < 39000
- alert: no more sms data
record: absent(sms_available)
for: 1h
Absence
@roidelapluie
Con guration
@roidelapluie
recipients:
name/channel
jpivotto/mail
opsteam/ticket
appteam/message
customer/sms
dc1/jenkins
Recipients
@roidelapluie
Alertmanager receivers
- name: "opsteam/mail"
email_configs:
- to: 'ops@inuits.eu'
send_resolved: yes
html: "{{ template "inuits.html.tmpl" . }}"
text: "{{ template "inuits.txt.tmpl" . }}"
headers:
Subject: "{{ template "title.tmpl" . }}"
Hint: Subject can be a template.
Receivers
@roidelapluie
Alertmanager receivers
- name: "opsteam/mail/noresolved"
email_configs:
- to: 'ops@inuits.eu'
send_resolved: no
html: "{{ template "inuits.html.tmpl" . }}"
text: "{{ template "inuits.txt.tmpl" . }}"
headers:
Subject: "{{ template "title.tmpl" . }}"
Same, but with send_resolved: no
Receivers
@roidelapluie
Alertmanager receivers
- name: "lotsOfPeople/mail"
email_configs:
- to: 'a@inuits.eu,b@inuits.eu,c@inuits.eu'
headers:
To: a@inuits.eu
CC: b@inuits.eu
Reply-To: support@inuits.eu
c@inuits.eu is now BCC.
Email: CC, BCC
@roidelapluie
Prometheus alert
- alert: Not enough traffic
expr: ...
for: 5m
labels:
recipients: customer1/sms,opsteam/ticket
annotations:
summary: ...
resolved_summary: ...
Who gets the alert?
@roidelapluie
Alertmanager routing
- receiver: "customer1/sms"
match_re:
recipient: "(.*,)?customer1/sms(,.*)?"
continue: true
routes: [...]
- receiver: "opsteam/ticket"
match_re:
recipient: "(.*,)?opsteam/ticket(,.*)?"
continue: true
routes: [...]
Who gets the alert?
@roidelapluie
Prometheus alert
- alert: Not enough traffic
expr: ...
for: 5m
labels:
recipients: customer1/sms,opsteam/ticket
send_resolved: "no"
Resolved
@roidelapluie
Alertmanager routing
- receiver: "customer1/sms"
match_re:
recipient: "(.*,)?customer1/sms(,.*)?"
continue: true
routes:
- receiver: customer1/sms/noresolved
match:
send_resolved: "no"
Resolved
@roidelapluie
Prometheus alert
- alert: Not enough traffic
expr: ...
for: 5m
labels:
recipients: customer1/sms,opsteam/ticket
repeat_interval: 1h
Repeat interval
@roidelapluie
Alertmanager routing
- receiver: "customer1/sms"
match_re:
recipient: "(.*,)?customer1/sms(,.*)?"
continue: true
routes:
- receiver: customer1/sms
repeat_interval: 1h
match:
repeat_interval: 1h
Repeat interval
@roidelapluie
Some channels have speci c group_interval: 0s.
Some channels always send_resolved: no.
Some recipients have aliases (ticket+chat).
Extra con gurations
@roidelapluie
Extract of amtool con g routes show
─ {recipient=~"^(?:(.*,)?jpivotto/mail(,.*)?)$"} continue: true receiver: jpivotto/mail
   ├── {repeat_interval="15m",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="15m"} receiver: jpivotto/mail
   ├── {repeat_interval="30m",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="30m"} receiver: jpivotto/mail
   ├── {repeat_interval="1h",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="1h"} receiver: jpivotto/mail
   ├── {repeat_interval="2h",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="2h"} receiver: jpivotto/mail
   ├── {repeat_interval="4h",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="4h"} receiver: jpivotto/mail
   ├── {repeat_interval="6h",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="6h"} receiver: jpivotto/mail
   ├── {repeat_interval="12h",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="12h"} receiver: jpivotto/mail
   ├── {repeat_interval="24h",send_resolved="no"} receiver:jpivotto/mail/noresolved
   ├── {repeat_interval="24h"} receiver: jpivotto/mail
   ├── {send_resolved="no"} receiver: jpivotto/mail/noresolved
   └── {repeat_interval=""} receiver: jpivotto/mail
Routes tree
@roidelapluie
Con g Management!
Our input
receivers:
customer:
email:
to: [customer@example.com]
cc: [service-management@inuits.eu]
bcc: [ops@inuits.eu]
sms: [+1234567890, +2345678901]
chat:
room: "#customer"
How do we achieve it?
@roidelapluie
• Script that is deployed with AM
• Knows all the recipients
• Will validate alerts yaml
• promtool
• mandatory labels
• validate receivers label
• validate repeat_interval label
Not possible to write alerts that go nowhere by
accident.
Con guration management
@roidelapluie
Time frame
@roidelapluie
Prometheus alert
- alert: a target is down
expr: up == 0
for: 5m
labels:
recipients: customer1/sms,opsteam/ticket
time_window: 13x5
Time frame
@roidelapluie
- record: daily_saving_time_belgium
expr: |
(vector(0) and (month() < 3 or month() > 10))
or
(vector(1) and (month() > 3 and month() < 10))
or
(
(
(month() %2 and (day_of_month() - day_of_week()
> (30 + +month() % 2 - 7)) and day_of_week() > 0)
or
-1*month()%2+1 and (day_of_month() -
day_of_week() <= (30 + month() % 2 - 7))
)
)
or
(vector(1) and ((month()==10 and hour() < 1) or (month()==3 and hour() > 0
or
vector(0)
- record: belgium_localtime
expr: |
time() + 3600 + 3600 * daily_saving_time_belgium
Timezone
@roidelapluie
hour(belgium_localtime)
hour() and other time-functions can take a
timestamp as argument.
Belgian hour
@roidelapluie
- record: public_holiday
expr: |
vector(1) and
day_of_month(belgium_localtime) == 25
and month(belgium_localtime) == 12
labels:
name: Xmas
Holidays
@roidelapluie
groups:
- name: Easter Meeus/Jones/Butcher Algorithm
interval: 60s
rules:
- record: easter_y
expr: year(belgium_localtime)
- record: easter_a
expr: easter_y % 19
- record: easter_b
expr: floor(easter_y / 100)
- record: easter_c
expr: easter_y % 100
- record: easter_d
expr: floor(easter_b / 4)
- record: easter_e
expr: easter_b % 4
- record: easter_f
expr: floor((easter_b +8 ) / 25)
- record: easter_g
expr: floor((easter_b - easter_f + 1 ) / 3)
- record: easter_h
expr: (19*easter_a + easter_b - easter_d - easter_g + 15 ) % 30
- record: easter_i
expr: floor(easter_c/4)
- record: easter_k
expr: easter_c%4
- record: easter_l
expr: (32 + 2*easter_e + 2*easter_i - easter_h - easter_k) % 7
- record: easter_m
expr: floor((easter_a + 11*easter_h + 22*easter_l) / 451)
- record: easter_month
expr: floor((easter_h + easter_l - 7*easter_m + 114) / 31)
- record: easter_day
expr: ((easter_h + easter_l - 7*easter_m + 114) %31) + 1
Easter
@roidelapluie Wikipedia CC-BY-SA-3.0
- record: public_holiday
expr: |
vector(1) and
day_of_month(belgium_localtime-86400) == easter_day
and month(belgium_localtime-86400) == easter_month
labels:
name: Easter Monday
- record: public_holiday
expr: |
vector(1) and
day_of_month(belgium_localtime-40*86400) == easter_day
and month(belgium_localtime-40*86400) == easter_month
labels:
name: Feast of the Ascension
Easter
@roidelapluie
- record: business_day
expr: |
vector(1) and day_of_week(belgium_localtime) > 0
and day_of_week(belgium_localtime) < 6
unless count(public_holiday)
- record: belgium_hour
expr: |
hour(belgium_localtime)
- record: business_hour
expr: |
vector(1) and belgium_hour >= 8 < 18
and business_day
Business hour
@roidelapluie
- record: extended_business_hour
expr: |
(vector(1) and belgium_hour >= 7 < 20
and business_day)
- record: extended_business_hour_sat
expr: |
extended_business_hour
or (vector(1) and belgium_hour >= 7 < 14
and day_of_week(belgium_localtime) == 6
unless count(public_holiday))
Extended business hours
@roidelapluie
(sum(rate(http_requests_total{code=~"5.."}[5m])) by (vhost)
> 10 and on () business_hour)
or
(sum(rate(http_requests_total{code=~"5.."}[5m])) by (vhost) > 1
and sum(rate(http_requests_total{code=~"2.."}[5m])) by (vhost) < 1)
and on () business_hour
Thresholds depending on time
@roidelapluie
- record: daylight
expr: |
vector(1) and belgium_hour >= 8 < 18
- record: extended_daylight
expr: |
vector(1) and belgium_hour >= 7 < 20
Day and night
@roidelapluie
- alert: Time Window - Night
expr: absent(daylight)
labels:
recipient: none
- alert: Time Window - OBH
expr: absent(business_hour)
labels:
recipient: none
- alert: Time Window - Extended Night
expr: absent(extended_daylight)
labels:
recipient: none
Alerts
@roidelapluie
- alert: Time Window - Extended OBH with Saturday
expr: absent(extended_business_hour_sat)
labels:
recipient: none
- alert: Time Window - Extended OBH
expr: absent(extended_business_hour)
labels:
recipient: none
Alerts
@roidelapluie
At this point, we will have "meaningless" alerts at night
and during business holidays.
Alerts
@roidelapluie
Inhibition is a concept of suppressing noti cations for
certain alerts if certain other alerts are already ring.
Inhibition
@roidelapluie Alertmanager documentation CC-BY-4.0
Alertmanager inhibition
inhibit_rules:
- source_match:
alertname: "Time Window - Night"
target_match:
time_window: 10x7
- source_match:
alertname: "Time Window - OBH"
target_match:
time_window: 10x5
Inhibition
@roidelapluie
Alertmanager inhibition
- source_match:
alertname: "Time Window - Extended Night"
target_match:
time_window: 13x7
- source_match:
alertname: "Time Window - Extended OBH"
target_match:
time_window: 13x5
- source_match:
alertname: "Time Window - Extended OBH with Saturday"
target_match:
time_window: 13x6
Inhibition
@roidelapluie
Alerts Relabeling
@roidelapluie
Prometheus alert
- alert: a target is down
expr: up == 0
for: 5m
labels:
recipients_prod: customer1/sms,opsteam/ticket
time_window_prod: 24x7
recipients: opsteam/chat
time_window: 8x5
Per env recipients
@roidelapluie
Prometheus con g
alerting:
alert_relabel_configs:
- source_labels: [time_window_prod,env]
regex: "(.+);prod"
target_label: time_window
replacement: '$1'
- source_labels: [time_window_dev,env]
regex: "(.+);dev"
target_label: time_window
replacement: '$1'
Repeat for other env, other labels.
Per env recipients
@roidelapluie
Prometheus con g
alerting:
alert_relabel_configs:
- source_labels: [time_window]
regex: "never"
action: drop
Be careful about the order (time_window can be
mutated by relabeling).
Drop alert
@roidelapluie
Conclusion
@roidelapluie
• Alert-Writing experience is great with this
• Prometheus and PromQL can do a lot
• Con g management lls the "gaps"
• With some e ort, we have everything we wanted
Conclusion
@roidelapluie
- name: normal rate
interval: 120s
rules:
- record: request_rate_history
expr: |
sum(rate(http_requests_total[5m])) by (env)
labels:
when: 0w
Bonus
@roidelapluie
- record: request_rate_history
expr: |
(
sum(rate(http_requests_total[5m] offset 168h)) by (env)
and on () (
daily_saving_time_belgium offset 1w
== daily_saving_time_belgium)
) or
(
sum(rate(http_requests_total[5m] offset 167h)) by (env)
and on () (
daily_saving_time_belgium offset 1w
< daily_saving_time_belgium)
) or
(
sum(rate(http_requests_total[5m] offset 169h)) by (env)
and on () (
daily_saving_time_belgium offset 1w
> daily_saving_time_belgium)
)
labels:
when: 1w
Bonus
@roidelapluie
- record: request_rate_normal
expr: |
max(bottomk(1,
topk(4, request_rate_history) by(env)
) by(env)) by(env)
Bonus
@roidelapluie
Bonus
@roidelapluie
Bonus
@roidelapluie
Really hope we will get rid of DST in 2021!
Bonus
@roidelapluie
Julien Pivotto
@roidelapluie
roidelapluie@inuits.eu
Essensteenweg 31
2930 Brasschaat
Belgium
Contact:
info@inuits.eu
+32-3-8082105

Weitere ähnliche Inhalte

Was ist angesagt?

Monitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialMonitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialTim Vaillancourt
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaSridhar Kumar N
 
Understanding and Extending Prometheus AlertManager
Understanding and Extending Prometheus AlertManagerUnderstanding and Extending Prometheus AlertManager
Understanding and Extending Prometheus AlertManagerLee Calcote
 
How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?Wojciech Barczyński
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Brian Brazil
 
End to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenEnd to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenParis Container Day
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaArvind Kumar G.S
 
Best Practices of Infrastructure as Code with Terraform
Best Practices of Infrastructure as Code with TerraformBest Practices of Infrastructure as Code with Terraform
Best Practices of Infrastructure as Code with TerraformDevOps.com
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus OverviewBrian Brazil
 
Monitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusMonitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusChandresh Pancholi
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Brian Brazil
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basicsJuraj Hantak
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)Lucas Jellema
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusGrafana Labs
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy Docker, Inc.
 
Terraform: An Overview & Introduction
Terraform: An Overview & IntroductionTerraform: An Overview & Introduction
Terraform: An Overview & IntroductionLee Trout
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Brian Brazil
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 

Was ist angesagt? (20)

Monitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialMonitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_Tutorial
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
 
Understanding and Extending Prometheus AlertManager
Understanding and Extending Prometheus AlertManagerUnderstanding and Extending Prometheus AlertManager
Understanding and Extending Prometheus AlertManager
 
How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
End to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenEnd to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max Inden
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Best Practices of Infrastructure as Code with Terraform
Best Practices of Infrastructure as Code with TerraformBest Practices of Infrastructure as Code with Terraform
Best Practices of Infrastructure as Code with Terraform
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Prometheus 101
Prometheus 101Prometheus 101
Prometheus 101
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
 
Monitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusMonitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheus
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basics
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy
 
Terraform: An Overview & Introduction
Terraform: An Overview & IntroductionTerraform: An Overview & Introduction
Terraform: An Overview & Introduction
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 

Ähnlich wie Improved alerting with Prometheus and Alertmanager

#include customer.h#include heap.h#include iostream.docx
#include customer.h#include heap.h#include iostream.docx#include customer.h#include heap.h#include iostream.docx
#include customer.h#include heap.h#include iostream.docxAASTHA76
 
AI&ML Conference 2019 - Deep Learning from zero to hero
AI&ML Conference 2019 - Deep Learning from zero to heroAI&ML Conference 2019 - Deep Learning from zero to hero
AI&ML Conference 2019 - Deep Learning from zero to heroGianluca Carucci
 
Accelerating OpenERP accounting: Precalculated period sums
Accelerating OpenERP accounting: Precalculated period sumsAccelerating OpenERP accounting: Precalculated period sums
Accelerating OpenERP accounting: Precalculated period sumsBorja López
 
Proposed pricing model for cloud computing
Proposed pricing model for cloud computingProposed pricing model for cloud computing
Proposed pricing model for cloud computingAdeel Javaid
 
Ch02 primitive-data-definite-loops
Ch02 primitive-data-definite-loopsCh02 primitive-data-definite-loops
Ch02 primitive-data-definite-loopsJames Brotsos
 
ch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptMahyuddin8
 
ch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptghoitsun
 
Objects? No thanks!
Objects? No thanks!Objects? No thanks!
Objects? No thanks!corehard_by
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIGeorge Markomanolis
 
Observability in a Dynamically Scheduled World
Observability in a Dynamically Scheduled WorldObservability in a Dynamically Scheduled World
Observability in a Dynamically Scheduled WorldSneha Inguva
 
C++ process new
C++ process newC++ process new
C++ process new敬倫 林
 
Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015
Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015
Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015Mozaic Works
 
Control Statement.ppt
Control Statement.pptControl Statement.ppt
Control Statement.pptsanjay
 
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusBol.com Techlab
 

Ähnlich wie Improved alerting with Prometheus and Alertmanager (20)

#include customer.h#include heap.h#include iostream.docx
#include customer.h#include heap.h#include iostream.docx#include customer.h#include heap.h#include iostream.docx
#include customer.h#include heap.h#include iostream.docx
 
AI&ML Conference 2019 - Deep Learning from zero to hero
AI&ML Conference 2019 - Deep Learning from zero to heroAI&ML Conference 2019 - Deep Learning from zero to hero
AI&ML Conference 2019 - Deep Learning from zero to hero
 
Accelerating OpenERP accounting: Precalculated period sums
Accelerating OpenERP accounting: Precalculated period sumsAccelerating OpenERP accounting: Precalculated period sums
Accelerating OpenERP accounting: Precalculated period sums
 
Proposed pricing model for cloud computing
Proposed pricing model for cloud computingProposed pricing model for cloud computing
Proposed pricing model for cloud computing
 
Code Tuning
Code TuningCode Tuning
Code Tuning
 
C# Loops
C# LoopsC# Loops
C# Loops
 
Performance
PerformancePerformance
Performance
 
14 queuing
14 queuing14 queuing
14 queuing
 
Ch02 primitive-data-definite-loops
Ch02 primitive-data-definite-loopsCh02 primitive-data-definite-loops
Ch02 primitive-data-definite-loops
 
ScalaMeter 2012
ScalaMeter 2012ScalaMeter 2012
ScalaMeter 2012
 
ch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.ppt
 
ch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.ppt
 
Objects? No thanks!
Objects? No thanks!Objects? No thanks!
Objects? No thanks!
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part II
 
Observability in a Dynamically Scheduled World
Observability in a Dynamically Scheduled WorldObservability in a Dynamically Scheduled World
Observability in a Dynamically Scheduled World
 
Ch3
Ch3Ch3
Ch3
 
C++ process new
C++ process newC++ process new
C++ process new
 
Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015
Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015
Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015
 
Control Statement.ppt
Control Statement.pptControl Statement.ppt
Control Statement.ppt
 
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to Prometheus
 

Mehr von Julien Pivotto

What's New in Prometheus and Its Ecosystem
What's New in Prometheus and Its EcosystemWhat's New in Prometheus and Its Ecosystem
What's New in Prometheus and Its EcosystemJulien Pivotto
 
Prometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is comingPrometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is comingJulien Pivotto
 
What's new in Prometheus?
What's new in Prometheus?What's new in Prometheus?
What's new in Prometheus?Julien Pivotto
 
Introduction to Grafana Loki
Introduction to Grafana LokiIntroduction to Grafana Loki
Introduction to Grafana LokiJulien Pivotto
 
Why you should revisit mgmt
Why you should revisit mgmtWhy you should revisit mgmt
Why you should revisit mgmtJulien Pivotto
 
Observing the HashiCorp Ecosystem From Prometheus
Observing the HashiCorp Ecosystem From PrometheusObserving the HashiCorp Ecosystem From Prometheus
Observing the HashiCorp Ecosystem From PrometheusJulien Pivotto
 
Monitoring in a fast-changing world with Prometheus
Monitoring in a fast-changing world with PrometheusMonitoring in a fast-changing world with Prometheus
Monitoring in a fast-changing world with PrometheusJulien Pivotto
 
5 tips for Prometheus Service Discovery
5 tips for Prometheus Service Discovery5 tips for Prometheus Service Discovery
5 tips for Prometheus Service DiscoveryJulien Pivotto
 
Prometheus and TLS - an Introduction
Prometheus and TLS - an IntroductionPrometheus and TLS - an Introduction
Prometheus and TLS - an IntroductionJulien Pivotto
 
Powerful graphs in Grafana
Powerful graphs in GrafanaPowerful graphs in Grafana
Powerful graphs in GrafanaJulien Pivotto
 
HAProxy as Egress Controller
HAProxy as Egress ControllerHAProxy as Egress Controller
HAProxy as Egress ControllerJulien Pivotto
 
SIngle Sign On with Keycloak
SIngle Sign On with KeycloakSIngle Sign On with Keycloak
SIngle Sign On with KeycloakJulien Pivotto
 
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaborationJulien Pivotto
 
Incident Resolution as Code
Incident Resolution as CodeIncident Resolution as Code
Incident Resolution as CodeJulien Pivotto
 
Monitor your CentOS stack with Prometheus
Monitor your CentOS stack with PrometheusMonitor your CentOS stack with Prometheus
Monitor your CentOS stack with PrometheusJulien Pivotto
 
Monitor your CentOS stack with Prometheus
Monitor your CentOS stack with PrometheusMonitor your CentOS stack with Prometheus
Monitor your CentOS stack with PrometheusJulien Pivotto
 
An introduction to Ansible
An introduction to AnsibleAn introduction to Ansible
An introduction to AnsibleJulien Pivotto
 

Mehr von Julien Pivotto (20)

The O11y Toolkit
The O11y ToolkitThe O11y Toolkit
The O11y Toolkit
 
What's New in Prometheus and Its Ecosystem
What's New in Prometheus and Its EcosystemWhat's New in Prometheus and Its Ecosystem
What's New in Prometheus and Its Ecosystem
 
Prometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is comingPrometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is coming
 
What's new in Prometheus?
What's new in Prometheus?What's new in Prometheus?
What's new in Prometheus?
 
Introduction to Grafana Loki
Introduction to Grafana LokiIntroduction to Grafana Loki
Introduction to Grafana Loki
 
Why you should revisit mgmt
Why you should revisit mgmtWhy you should revisit mgmt
Why you should revisit mgmt
 
Observing the HashiCorp Ecosystem From Prometheus
Observing the HashiCorp Ecosystem From PrometheusObserving the HashiCorp Ecosystem From Prometheus
Observing the HashiCorp Ecosystem From Prometheus
 
Monitoring in a fast-changing world with Prometheus
Monitoring in a fast-changing world with PrometheusMonitoring in a fast-changing world with Prometheus
Monitoring in a fast-changing world with Prometheus
 
5 tips for Prometheus Service Discovery
5 tips for Prometheus Service Discovery5 tips for Prometheus Service Discovery
5 tips for Prometheus Service Discovery
 
Prometheus and TLS - an Introduction
Prometheus and TLS - an IntroductionPrometheus and TLS - an Introduction
Prometheus and TLS - an Introduction
 
Powerful graphs in Grafana
Powerful graphs in GrafanaPowerful graphs in Grafana
Powerful graphs in Grafana
 
YAML Magic
YAML MagicYAML Magic
YAML Magic
 
HAProxy as Egress Controller
HAProxy as Egress ControllerHAProxy as Egress Controller
HAProxy as Egress Controller
 
SIngle Sign On with Keycloak
SIngle Sign On with KeycloakSIngle Sign On with Keycloak
SIngle Sign On with Keycloak
 
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaboration
 
Incident Resolution as Code
Incident Resolution as CodeIncident Resolution as Code
Incident Resolution as Code
 
Monitor your CentOS stack with Prometheus
Monitor your CentOS stack with PrometheusMonitor your CentOS stack with Prometheus
Monitor your CentOS stack with Prometheus
 
Monitor your CentOS stack with Prometheus
Monitor your CentOS stack with PrometheusMonitor your CentOS stack with Prometheus
Monitor your CentOS stack with Prometheus
 
An introduction to Ansible
An introduction to AnsibleAn introduction to Ansible
An introduction to Ansible
 
Jsonnet
JsonnetJsonnet
Jsonnet
 

Kürzlich hochgeladen

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Kürzlich hochgeladen (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Improved alerting with Prometheus and Alertmanager

  • 1. PromCon Munich Julien Pivotto @roidelapluie Improved alerting with Prometheus and Alertmanager November 8th, 2019
  • 2. • This talk contains PromQL. • This talk contains YAML. • What you will see was built over time. Important notes! @roidelapluie
  • 4. Message Broker in the Belgian healthcare sector • High visibility • Sync & Async • Legacy & New • Lots of partners • Multiple customers Message Broker @roidelapluie
  • 6. Alerts are not only for incidents. Some alerts carry business information about ongoing events (good or bad). Some alerts go outside of our org. Some alerts are not for humans. Alerting @roidelapluie
  • 8. Repeat every 15m, 1h, 4h, 24h, 2d 24x7, 10x5, 12x6, 10x7, never Legal holidays Time frames @roidelapluie
  • 9. Updated annotations & value Updated graphs 15m/1h repeat interval? @roidelapluie
  • 10. • Alertmanager owns the noti cations • Webhook receivers have no logic • Take decisions at time of alert writing Constraints @roidelapluie
  • 11. • Avoid Alertmanager recon gurations • Safe and easy way to write alerts • Only send relevant alerts • Alert on staging environments Challenges @roidelapluie
  • 13. - alert: a target is down expr: up == 0 for: 5m Gauges @roidelapluie
  • 15. Instead of: - alert: a target is down expr: up == 0 for: 5m Do: - alert: a target is down expr: avg_over_time(up[5m]) < .9 for: 5m Gauges @roidelapluie
  • 16. Alert me if temperature is above 27°C Hysteresis @roidelapluie
  • 17. - alert: temperature is above threshold expr: temperature_celcius > 27 for: 5m labels: priority: high Hysteresis @roidelapluie
  • 19. Hysteresis is the dependence of the state of a system on its history. Hysteresis @roidelapluie Wikipedia CC-BY-SA-3.0
  • 20. - alert: temperature is above threshold expr: | avg_over_time(temperature_celcius[5m]) > 27 for: 5m labels: priority: high alternative: max_over_time 5m might be too short if > 5m: when is it resolved? Hysteresis @roidelapluie
  • 21. Alert me • if temperature is above 27°C • only stop when it gets below 25°C Hysteresis @roidelapluie
  • 22. (avg_over_time(temperature_celcius[5m]) > 27) or (temperature_celcius > 25 and count without (alertstate, alertname, priority) ALERTS{ alertstate="firing", alertname="temperature is above threshold" }) Hysteresis @roidelapluie
  • 23. temperature_celcius > 27 but... Computed threshold @roidelapluie
  • 24. - record: temperature_threshold_celcius expr: | 27+0*temperature_celcius{ location=~".*ambiant" } or 25+0*temperature_celcius Bonus: temperature_threshold_celcius can be used in grafana! Computed threshold @roidelapluie
  • 25. - alert: temperature is above threshold expr: | temperature_celcius > temperature_threshold_celcius Note: put threshold & alert in the same alert group Computed threshold @roidelapluie
  • 26. - alert: no more sms expr: sms_available < 39000 Absence @roidelapluie
  • 28. No metric = No alert! Metric is back = New alert! Absence @roidelapluie
  • 29. - record: sms_available_last expr: | sms_available or sms_available_last - alert: no more sms record: sms_available_last < 39000 - alert: no more sms data record: absent(sms_available) for: 1h Absence @roidelapluie
  • 32. Alertmanager receivers - name: "opsteam/mail" email_configs: - to: 'ops@inuits.eu' send_resolved: yes html: "{{ template "inuits.html.tmpl" . }}" text: "{{ template "inuits.txt.tmpl" . }}" headers: Subject: "{{ template "title.tmpl" . }}" Hint: Subject can be a template. Receivers @roidelapluie
  • 33. Alertmanager receivers - name: "opsteam/mail/noresolved" email_configs: - to: 'ops@inuits.eu' send_resolved: no html: "{{ template "inuits.html.tmpl" . }}" text: "{{ template "inuits.txt.tmpl" . }}" headers: Subject: "{{ template "title.tmpl" . }}" Same, but with send_resolved: no Receivers @roidelapluie
  • 34. Alertmanager receivers - name: "lotsOfPeople/mail" email_configs: - to: 'a@inuits.eu,b@inuits.eu,c@inuits.eu' headers: To: a@inuits.eu CC: b@inuits.eu Reply-To: support@inuits.eu c@inuits.eu is now BCC. Email: CC, BCC @roidelapluie
  • 35. Prometheus alert - alert: Not enough traffic expr: ... for: 5m labels: recipients: customer1/sms,opsteam/ticket annotations: summary: ... resolved_summary: ... Who gets the alert? @roidelapluie
  • 36. Alertmanager routing - receiver: "customer1/sms" match_re: recipient: "(.*,)?customer1/sms(,.*)?" continue: true routes: [...] - receiver: "opsteam/ticket" match_re: recipient: "(.*,)?opsteam/ticket(,.*)?" continue: true routes: [...] Who gets the alert? @roidelapluie
  • 37. Prometheus alert - alert: Not enough traffic expr: ... for: 5m labels: recipients: customer1/sms,opsteam/ticket send_resolved: "no" Resolved @roidelapluie
  • 38. Alertmanager routing - receiver: "customer1/sms" match_re: recipient: "(.*,)?customer1/sms(,.*)?" continue: true routes: - receiver: customer1/sms/noresolved match: send_resolved: "no" Resolved @roidelapluie
  • 39. Prometheus alert - alert: Not enough traffic expr: ... for: 5m labels: recipients: customer1/sms,opsteam/ticket repeat_interval: 1h Repeat interval @roidelapluie
  • 40. Alertmanager routing - receiver: "customer1/sms" match_re: recipient: "(.*,)?customer1/sms(,.*)?" continue: true routes: - receiver: customer1/sms repeat_interval: 1h match: repeat_interval: 1h Repeat interval @roidelapluie
  • 41. Some channels have speci c group_interval: 0s. Some channels always send_resolved: no. Some recipients have aliases (ticket+chat). Extra con gurations @roidelapluie
  • 42. Extract of amtool con g routes show ─ {recipient=~"^(?:(.*,)?jpivotto/mail(,.*)?)$"} continue: true receiver: jpivotto/mail    ├── {repeat_interval="15m",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="15m"} receiver: jpivotto/mail    ├── {repeat_interval="30m",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="30m"} receiver: jpivotto/mail    ├── {repeat_interval="1h",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="1h"} receiver: jpivotto/mail    ├── {repeat_interval="2h",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="2h"} receiver: jpivotto/mail    ├── {repeat_interval="4h",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="4h"} receiver: jpivotto/mail    ├── {repeat_interval="6h",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="6h"} receiver: jpivotto/mail    ├── {repeat_interval="12h",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="12h"} receiver: jpivotto/mail    ├── {repeat_interval="24h",send_resolved="no"} receiver:jpivotto/mail/noresolved    ├── {repeat_interval="24h"} receiver: jpivotto/mail    ├── {send_resolved="no"} receiver: jpivotto/mail/noresolved    └── {repeat_interval=""} receiver: jpivotto/mail Routes tree @roidelapluie
  • 43. Con g Management! Our input receivers: customer: email: to: [customer@example.com] cc: [service-management@inuits.eu] bcc: [ops@inuits.eu] sms: [+1234567890, +2345678901] chat: room: "#customer" How do we achieve it? @roidelapluie
  • 44. • Script that is deployed with AM • Knows all the recipients • Will validate alerts yaml • promtool • mandatory labels • validate receivers label • validate repeat_interval label Not possible to write alerts that go nowhere by accident. Con guration management @roidelapluie
  • 46. Prometheus alert - alert: a target is down expr: up == 0 for: 5m labels: recipients: customer1/sms,opsteam/ticket time_window: 13x5 Time frame @roidelapluie
  • 47. - record: daily_saving_time_belgium expr: | (vector(0) and (month() < 3 or month() > 10)) or (vector(1) and (month() > 3 and month() < 10)) or ( ( (month() %2 and (day_of_month() - day_of_week() > (30 + +month() % 2 - 7)) and day_of_week() > 0) or -1*month()%2+1 and (day_of_month() - day_of_week() <= (30 + month() % 2 - 7)) ) ) or (vector(1) and ((month()==10 and hour() < 1) or (month()==3 and hour() > 0 or vector(0) - record: belgium_localtime expr: | time() + 3600 + 3600 * daily_saving_time_belgium Timezone @roidelapluie
  • 48. hour(belgium_localtime) hour() and other time-functions can take a timestamp as argument. Belgian hour @roidelapluie
  • 49. - record: public_holiday expr: | vector(1) and day_of_month(belgium_localtime) == 25 and month(belgium_localtime) == 12 labels: name: Xmas Holidays @roidelapluie
  • 50. groups: - name: Easter Meeus/Jones/Butcher Algorithm interval: 60s rules: - record: easter_y expr: year(belgium_localtime) - record: easter_a expr: easter_y % 19 - record: easter_b expr: floor(easter_y / 100) - record: easter_c expr: easter_y % 100 - record: easter_d expr: floor(easter_b / 4) - record: easter_e expr: easter_b % 4 - record: easter_f expr: floor((easter_b +8 ) / 25) - record: easter_g expr: floor((easter_b - easter_f + 1 ) / 3) - record: easter_h expr: (19*easter_a + easter_b - easter_d - easter_g + 15 ) % 30 - record: easter_i expr: floor(easter_c/4) - record: easter_k expr: easter_c%4 - record: easter_l expr: (32 + 2*easter_e + 2*easter_i - easter_h - easter_k) % 7 - record: easter_m expr: floor((easter_a + 11*easter_h + 22*easter_l) / 451) - record: easter_month expr: floor((easter_h + easter_l - 7*easter_m + 114) / 31) - record: easter_day expr: ((easter_h + easter_l - 7*easter_m + 114) %31) + 1 Easter @roidelapluie Wikipedia CC-BY-SA-3.0
  • 51. - record: public_holiday expr: | vector(1) and day_of_month(belgium_localtime-86400) == easter_day and month(belgium_localtime-86400) == easter_month labels: name: Easter Monday - record: public_holiday expr: | vector(1) and day_of_month(belgium_localtime-40*86400) == easter_day and month(belgium_localtime-40*86400) == easter_month labels: name: Feast of the Ascension Easter @roidelapluie
  • 52. - record: business_day expr: | vector(1) and day_of_week(belgium_localtime) > 0 and day_of_week(belgium_localtime) < 6 unless count(public_holiday) - record: belgium_hour expr: | hour(belgium_localtime) - record: business_hour expr: | vector(1) and belgium_hour >= 8 < 18 and business_day Business hour @roidelapluie
  • 53. - record: extended_business_hour expr: | (vector(1) and belgium_hour >= 7 < 20 and business_day) - record: extended_business_hour_sat expr: | extended_business_hour or (vector(1) and belgium_hour >= 7 < 14 and day_of_week(belgium_localtime) == 6 unless count(public_holiday)) Extended business hours @roidelapluie
  • 54. (sum(rate(http_requests_total{code=~"5.."}[5m])) by (vhost) > 10 and on () business_hour) or (sum(rate(http_requests_total{code=~"5.."}[5m])) by (vhost) > 1 and sum(rate(http_requests_total{code=~"2.."}[5m])) by (vhost) < 1) and on () business_hour Thresholds depending on time @roidelapluie
  • 55. - record: daylight expr: | vector(1) and belgium_hour >= 8 < 18 - record: extended_daylight expr: | vector(1) and belgium_hour >= 7 < 20 Day and night @roidelapluie
  • 56. - alert: Time Window - Night expr: absent(daylight) labels: recipient: none - alert: Time Window - OBH expr: absent(business_hour) labels: recipient: none - alert: Time Window - Extended Night expr: absent(extended_daylight) labels: recipient: none Alerts @roidelapluie
  • 57. - alert: Time Window - Extended OBH with Saturday expr: absent(extended_business_hour_sat) labels: recipient: none - alert: Time Window - Extended OBH expr: absent(extended_business_hour) labels: recipient: none Alerts @roidelapluie
  • 58. At this point, we will have "meaningless" alerts at night and during business holidays. Alerts @roidelapluie
  • 59. Inhibition is a concept of suppressing noti cations for certain alerts if certain other alerts are already ring. Inhibition @roidelapluie Alertmanager documentation CC-BY-4.0
  • 60. Alertmanager inhibition inhibit_rules: - source_match: alertname: "Time Window - Night" target_match: time_window: 10x7 - source_match: alertname: "Time Window - OBH" target_match: time_window: 10x5 Inhibition @roidelapluie
  • 61. Alertmanager inhibition - source_match: alertname: "Time Window - Extended Night" target_match: time_window: 13x7 - source_match: alertname: "Time Window - Extended OBH" target_match: time_window: 13x5 - source_match: alertname: "Time Window - Extended OBH with Saturday" target_match: time_window: 13x6 Inhibition @roidelapluie
  • 63. Prometheus alert - alert: a target is down expr: up == 0 for: 5m labels: recipients_prod: customer1/sms,opsteam/ticket time_window_prod: 24x7 recipients: opsteam/chat time_window: 8x5 Per env recipients @roidelapluie
  • 64. Prometheus con g alerting: alert_relabel_configs: - source_labels: [time_window_prod,env] regex: "(.+);prod" target_label: time_window replacement: '$1' - source_labels: [time_window_dev,env] regex: "(.+);dev" target_label: time_window replacement: '$1' Repeat for other env, other labels. Per env recipients @roidelapluie
  • 65. Prometheus con g alerting: alert_relabel_configs: - source_labels: [time_window] regex: "never" action: drop Be careful about the order (time_window can be mutated by relabeling). Drop alert @roidelapluie
  • 67. • Alert-Writing experience is great with this • Prometheus and PromQL can do a lot • Con g management lls the "gaps" • With some e ort, we have everything we wanted Conclusion @roidelapluie
  • 68. - name: normal rate interval: 120s rules: - record: request_rate_history expr: | sum(rate(http_requests_total[5m])) by (env) labels: when: 0w Bonus @roidelapluie
  • 69. - record: request_rate_history expr: | ( sum(rate(http_requests_total[5m] offset 168h)) by (env) and on () ( daily_saving_time_belgium offset 1w == daily_saving_time_belgium) ) or ( sum(rate(http_requests_total[5m] offset 167h)) by (env) and on () ( daily_saving_time_belgium offset 1w < daily_saving_time_belgium) ) or ( sum(rate(http_requests_total[5m] offset 169h)) by (env) and on () ( daily_saving_time_belgium offset 1w > daily_saving_time_belgium) ) labels: when: 1w Bonus @roidelapluie
  • 70. - record: request_rate_normal expr: | max(bottomk(1, topk(4, request_rate_history) by(env) ) by(env)) by(env) Bonus @roidelapluie
  • 73. Really hope we will get rid of DST in 2021! Bonus @roidelapluie
  • 74. Julien Pivotto @roidelapluie roidelapluie@inuits.eu Essensteenweg 31 2930 Brasschaat Belgium Contact: info@inuits.eu +32-3-8082105