SlideShare ist ein Scribd-Unternehmen logo
1 von 35
CONTINUOUS DELIVERY.
CONTINUOUS DEVOPS.
Professional conference on DevOps practices
6APRIL 2019
KYIV,
th
6APRIL 2019 KYIV,
Björn “Beorn” Rabenstein
Applied Alerting Philosophy
th
My Philosophy on Alerting
based my observations while I was a Site Reliability Engineer at Google
Author: Rob Ewaschuk <rob@infinitepigeons.org>
Introduction
Vernacular
Monitor for your users
Cause-based alerts are bad (but sometimes necessary)
Alerting from the spout (or beyond!)
Causes are still useful
Tickets, Reports and Email
Playbooks
Tracking & Accountability
You're being naïve!
Summary
Summary
When you are auditing or writing alerting rules, consider these things to keep your oncall rotation happier:
goo.gl/2vrpSO
“What” versus “why” is one of the most important
distinctions in writing good monitoring with maximum
signal and minimum noise.
Chapter 6: Monitoring Distributed Systems
Symptoms vs. causes
Source: Betsy Beyer et al. “Site Reliability Engineering – How Google Runs Production Systems”
Expected response SRE book SoundCloud lingo Delivered to Based on
Act immediately Alerts Pages
severity="critical"
Pager Symptoms
Act eventually Tickets Tickets / “email alerts”
severity="warning"
Issue tracker /
Chat / email :-(
Symptoms or
causes
None (for diagnostics
only)
Logs Informational alerts
severity="info"
Nowhere /
dashboards
Causes
“Alerts” according to Prometheus:
Pages vs. tickets
“One person’s symptom is another person’s cause.”
“Not-yet-occurring but imminent problems.”
“Zero-redundancy (N + 0) situations count as imminent,
as do ‘nearly full’ parts of your service!”
What also counts as “symptoms”…
white-box
(needs instrumentation)
black-box
(no changes required)
host-based
“traditional”
service-based
“modern”
Black-box
monitoring
FTW?!?
Prometheus
Blackbox
Exporter
Black-box vs. white-box
Probing with real user traffic in multi-tiered services
Frontend service
(instrumented)
Backend service
A
Backend service
B
User
traffic
Measures
A’s and B’s
latency,
rps,
errors…
Alerts owners
of A or B
Alerting on SLOs at
https://developers.soundcloud.com/blog/how-soundclo
ud-uses-haproxy-with-kubernetes-for-user-facing-traffic
AMPELMANN GmbH CC BY-SA 4.0
- record: backend:http_errors_per_response:ratio_rate5m
expr: |2
sum by (backend)(rate(
haproxy_backend_http_responses_total{job="ampelmann", code="5xx"}[5m]
))
/
sum by (backend)(rate(
haproxy_backend_http_responses_total{job="ampelmann"}[5m]
))
- record: backend:error_slo:percent
labels:
backend: "api-v4"
expr: 0.1
- record: backend:error_slo:percent
labels:
backend: "api-v3"
expr: 0.2
# ... Many more backends.
SLO budget consumption Time window Burn rate Notification
2% 1 hour 14.4 Page
5% 6 hours 6 Page
10% 3 days 1 Ticket
expr: (
job:slo_errors_per_request:ratio_rate1h{job="myjob"} > (14.4*0.001)
or
job:slo_errors_per_request:ratio_rate6h{job="myjob"} > (6*0.001)
)
severity: page
expr: job:slo_errors_per_request:ratio_rate3d{job="myjob"} > 0.001
severity: ticket
- alert: AmpelmannErrorBudgetBurn
expr: |2
(
100 * backend:http_errors_per_response:ratio_rate1h
> on (backend)
14.4 * backend:error_slo:percent
)
and
(
100 * backend:http_errors_per_response:ratio_rate5m
> on (backend)
14.4 * backend:error_slo:percent
)
for: 2m
labels:
system: "{{$labels.backend}}"
severity: "critical"
window: "1h"
annotations:
summary: "a backend burns its error budget very fast"
description: "Backend {{$labels.backend}} has returned {{ $value | printf `%.2f` }}% 5xx
runbook: "http://runbooks.soundcloud.com/runbooks/ampelmann/#ampelmannerrorbudgetburn"
Alert Long window Short window for duration Burn rate
factor
Error budget
consumed
Page 1h 5m 2m 14.4 2%
Page 6h 30m 15m 6 5%
Ticket 1d 2h 1h 3 10%
Ticket 3d 6h 1h 1 10%
Who gets the page?
- alert: AmpelmannErrorBudgetBurn
expr: |2
(
100 * backend:http_errors_per_response:ratio_rate1h
> on (backend)
14.4 * backend:error_slo:percent
)
and
(
100 * backend:http_errors_per_response:ratio_rate5m
> on (backend)
14.4 * backend:error_slo:percent
)
for: 2m
labels:
system: "{{$labels.backend}}"
severity: "critical"
window: "1h"
annotations:
summary: "a backend burns its error budget very fast"
description: "Backend {{$labels.backend}} has returned {{ $value | printf `%.2f` }}% 5xx
runbook: "http://runbooks.soundcloud.com/runbooks/ampelmann/#ampelmannerrorbudgetburn"
route:
receiver: prodeng-warn
group_by:
- alertname
- zone
- system
routes:
- receiver: api-team-warn
match:
system: api-v4
routes:
- receiver: api-team-crit
match:
severity: critical
group_wait: 20s
group_interval: 5m
repeat_interval: 3h
- receiver: api-team-info
match:
severity: info
https://prometheus.io
https://github.com/beorn7/talks
https://developers.soundcloud.com/blog

Weitere ähnliche Inhalte

Ähnlich wie DevOps Fest 2019. Björn Rabenstein. Applied Alerting Philosophy

Php day 20 11 e xo continuousintegration php
Php day 20 11 e xo continuousintegration phpPhp day 20 11 e xo continuousintegration php
Php day 20 11 e xo continuousintegration php
Quang Anh Le
 

Ähnlich wie DevOps Fest 2019. Björn Rabenstein. Applied Alerting Philosophy (20)

Modern Web Security, Lazy but Mindful Like a Fox
Modern Web Security, Lazy but Mindful Like a FoxModern Web Security, Lazy but Mindful Like a Fox
Modern Web Security, Lazy but Mindful Like a Fox
 
Web Ex2 28 Jan09
Web Ex2 28 Jan09Web Ex2 28 Jan09
Web Ex2 28 Jan09
 
FaaS or not to FaaS. Visible and invisible benefits of the Serverless paradig...
FaaS or not to FaaS. Visible and invisible benefits of the Serverless paradig...FaaS or not to FaaS. Visible and invisible benefits of the Serverless paradig...
FaaS or not to FaaS. Visible and invisible benefits of the Serverless paradig...
 
Operations: Production Readiness
Operations: Production ReadinessOperations: Production Readiness
Operations: Production Readiness
 
Dependency check
Dependency checkDependency check
Dependency check
 
FaaS or not to FaaS. Visible and invisible benefits of the Serverless paradig...
FaaS or not to FaaS. Visible and invisible benefits of the Serverless paradig...FaaS or not to FaaS. Visible and invisible benefits of the Serverless paradig...
FaaS or not to FaaS. Visible and invisible benefits of the Serverless paradig...
 
Tastypie: Easy APIs to Make Your Work Easier
Tastypie: Easy APIs to Make Your Work EasierTastypie: Easy APIs to Make Your Work Easier
Tastypie: Easy APIs to Make Your Work Easier
 
FaaS or not to FaaS. Visible and invisible benefits of the Serverless paradig...
FaaS or not to FaaS. Visible and invisible benefits of the Serverless paradig...FaaS or not to FaaS. Visible and invisible benefits of the Serverless paradig...
FaaS or not to FaaS. Visible and invisible benefits of the Serverless paradig...
 
Product! - The road to production deployment
Product! - The road to production deploymentProduct! - The road to production deployment
Product! - The road to production deployment
 
Google App Engine for Java
Google App Engine for JavaGoogle App Engine for Java
Google App Engine for Java
 
App checker
App checkerApp checker
App checker
 
CSG 2012
CSG 2012CSG 2012
CSG 2012
 
Continuous integration php
Continuous integration phpContinuous integration php
Continuous integration php
 
Php day 20 11 e xo continuousintegration php
Php day 20 11 e xo continuousintegration phpPhp day 20 11 e xo continuousintegration php
Php day 20 11 e xo continuousintegration php
 
Operations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from HappeningOperations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from Happening
 
WoMakersCode 2016 - Shit Happens
WoMakersCode 2016 -  Shit HappensWoMakersCode 2016 -  Shit Happens
WoMakersCode 2016 - Shit Happens
 
Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
Start Up Austin 2017: Production Preview - How to Stop Bad Things From HappeningStart Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
 
Cloud Economics
Cloud EconomicsCloud Economics
Cloud Economics
 
Starting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for OpsStarting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for Ops
 
Blue Teamin' on a Budget [of zero]
Blue Teamin' on a Budget [of zero]Blue Teamin' on a Budget [of zero]
Blue Teamin' on a Budget [of zero]
 

Mehr von DevOps_Fest

DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...
DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...
DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...
DevOps_Fest
 
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
DevOps_Fest
 
DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...
DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...
DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...
DevOps_Fest
 

Mehr von DevOps_Fest (20)

DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
 
DevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CD
DevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CDDevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CD
DevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CD
 
DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...
DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...
DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...
 
DevOps Fest 2020. James Spiteri. Advanced Security Operations with Elastic Se...
DevOps Fest 2020. James Spiteri. Advanced Security Operations with Elastic Se...DevOps Fest 2020. James Spiteri. Advanced Security Operations with Elastic Se...
DevOps Fest 2020. James Spiteri. Advanced Security Operations with Elastic Se...
 
DevOps Fest 2020. Pavlo Repalo. Edge Computing: Appliance and Challanges
DevOps Fest 2020. Pavlo Repalo. Edge Computing: Appliance and ChallangesDevOps Fest 2020. Pavlo Repalo. Edge Computing: Appliance and Challanges
DevOps Fest 2020. Pavlo Repalo. Edge Computing: Appliance and Challanges
 
DevOps Fest 2020. Максим Безуглый. DevOps - как архитектура в процессе. Две к...
DevOps Fest 2020. Максим Безуглый. DevOps - как архитектура в процессе. Две к...DevOps Fest 2020. Максим Безуглый. DevOps - как архитектура в процессе. Две к...
DevOps Fest 2020. Максим Безуглый. DevOps - как архитектура в процессе. Две к...
 
DevOps Fest 2020. Павел Жданов та Никора Никита. Построение процесса CI\CD дл...
DevOps Fest 2020. Павел Жданов та Никора Никита. Построение процесса CI\CD дл...DevOps Fest 2020. Павел Жданов та Никора Никита. Построение процесса CI\CD дл...
DevOps Fest 2020. Павел Жданов та Никора Никита. Построение процесса CI\CD дл...
 
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
 
DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...
DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...
DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...
 
DevOps Fest 2020. Дмитрий Кудрявцев. Реализация GitOps на Kubernetes. ArgoCD
DevOps Fest 2020. Дмитрий Кудрявцев. Реализация GitOps на Kubernetes. ArgoCDDevOps Fest 2020. Дмитрий Кудрявцев. Реализация GitOps на Kubernetes. ArgoCD
DevOps Fest 2020. Дмитрий Кудрявцев. Реализация GitOps на Kubernetes. ArgoCD
 
DevOps Fest 2020. Роман Орлов. Инфраструктура тестирования в Kubernetes
DevOps Fest 2020. Роман Орлов. Инфраструктура тестирования в KubernetesDevOps Fest 2020. Роман Орлов. Инфраструктура тестирования в Kubernetes
DevOps Fest 2020. Роман Орлов. Инфраструктура тестирования в Kubernetes
 
DevOps Fest 2020. Андрей Шишенко. CI/CD for AWS Lambdas with Serverless frame...
DevOps Fest 2020. Андрей Шишенко. CI/CD for AWS Lambdas with Serverless frame...DevOps Fest 2020. Андрей Шишенко. CI/CD for AWS Lambdas with Serverless frame...
DevOps Fest 2020. Андрей Шишенко. CI/CD for AWS Lambdas with Serverless frame...
 
DevOps Fest 2020. Александр Глущенко. Modern Enterprise Network Architecture ...
DevOps Fest 2020. Александр Глущенко. Modern Enterprise Network Architecture ...DevOps Fest 2020. Александр Глущенко. Modern Enterprise Network Architecture ...
DevOps Fest 2020. Александр Глущенко. Modern Enterprise Network Architecture ...
 
DevOps Fest 2020. Виталий Складчиков. Сквозь монолитный enterprise к микросер...
DevOps Fest 2020. Виталий Складчиков. Сквозь монолитный enterprise к микросер...DevOps Fest 2020. Виталий Складчиков. Сквозь монолитный enterprise к микросер...
DevOps Fest 2020. Виталий Складчиков. Сквозь монолитный enterprise к микросер...
 
DevOps Fest 2020. Денис Медведенко. Управление сложными многокомпонентными ин...
DevOps Fest 2020. Денис Медведенко. Управление сложными многокомпонентными ин...DevOps Fest 2020. Денис Медведенко. Управление сложными многокомпонентными ин...
DevOps Fest 2020. Денис Медведенко. Управление сложными многокомпонентными ин...
 
DevOps Fest 2020. Павел Галушко. Что делать devops'у если у вас захотели mach...
DevOps Fest 2020. Павел Галушко. Что делать devops'у если у вас захотели mach...DevOps Fest 2020. Павел Галушко. Что делать devops'у если у вас захотели mach...
DevOps Fest 2020. Павел Галушко. Что делать devops'у если у вас захотели mach...
 
DevOps Fest 2020. Сергей Абаничев. Modern CI\CD pipeline with Azure DevOps
DevOps Fest 2020. Сергей Абаничев. Modern CI\CD pipeline with Azure DevOpsDevOps Fest 2020. Сергей Абаничев. Modern CI\CD pipeline with Azure DevOps
DevOps Fest 2020. Сергей Абаничев. Modern CI\CD pipeline with Azure DevOps
 
DevOps Fest 2020. Philipp Krenn. Scale Your Auditing Events
DevOps Fest 2020. Philipp Krenn. Scale Your Auditing EventsDevOps Fest 2020. Philipp Krenn. Scale Your Auditing Events
DevOps Fest 2020. Philipp Krenn. Scale Your Auditing Events
 
DevOps Fest 2020. Володимир Мельник. TuchaKube - перша українська DevOps/Host...
DevOps Fest 2020. Володимир Мельник. TuchaKube - перша українська DevOps/Host...DevOps Fest 2020. Володимир Мельник. TuchaKube - перша українська DevOps/Host...
DevOps Fest 2020. Володимир Мельник. TuchaKube - перша українська DevOps/Host...
 
DevOps Fest 2020. Денис Васильев. Let's make it KUL! Kubernetes Ultra Light
DevOps Fest 2020. Денис Васильев. Let's make it KUL! Kubernetes Ultra LightDevOps Fest 2020. Денис Васильев. Let's make it KUL! Kubernetes Ultra Light
DevOps Fest 2020. Денис Васильев. Let's make it KUL! Kubernetes Ultra Light
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Tatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsTatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf arts
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 

DevOps Fest 2019. Björn Rabenstein. Applied Alerting Philosophy

  • 1. CONTINUOUS DELIVERY. CONTINUOUS DEVOPS. Professional conference on DevOps practices 6APRIL 2019 KYIV, th
  • 2. 6APRIL 2019 KYIV, Björn “Beorn” Rabenstein Applied Alerting Philosophy th
  • 3.
  • 4.
  • 5.
  • 6.
  • 7. My Philosophy on Alerting based my observations while I was a Site Reliability Engineer at Google Author: Rob Ewaschuk <rob@infinitepigeons.org> Introduction Vernacular Monitor for your users Cause-based alerts are bad (but sometimes necessary) Alerting from the spout (or beyond!) Causes are still useful Tickets, Reports and Email Playbooks Tracking & Accountability You're being naïve! Summary Summary When you are auditing or writing alerting rules, consider these things to keep your oncall rotation happier: goo.gl/2vrpSO
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14. “What” versus “why” is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise. Chapter 6: Monitoring Distributed Systems Symptoms vs. causes Source: Betsy Beyer et al. “Site Reliability Engineering – How Google Runs Production Systems”
  • 15. Expected response SRE book SoundCloud lingo Delivered to Based on Act immediately Alerts Pages severity="critical" Pager Symptoms Act eventually Tickets Tickets / “email alerts” severity="warning" Issue tracker / Chat / email :-( Symptoms or causes None (for diagnostics only) Logs Informational alerts severity="info" Nowhere / dashboards Causes “Alerts” according to Prometheus: Pages vs. tickets
  • 16. “One person’s symptom is another person’s cause.” “Not-yet-occurring but imminent problems.” “Zero-redundancy (N + 0) situations count as imminent, as do ‘nearly full’ parts of your service!” What also counts as “symptoms”…
  • 17. white-box (needs instrumentation) black-box (no changes required) host-based “traditional” service-based “modern” Black-box monitoring FTW?!? Prometheus Blackbox Exporter Black-box vs. white-box
  • 18. Probing with real user traffic in multi-tiered services Frontend service (instrumented) Backend service A Backend service B User traffic Measures A’s and B’s latency, rps, errors… Alerts owners of A or B
  • 21.
  • 22. - record: backend:http_errors_per_response:ratio_rate5m expr: |2 sum by (backend)(rate( haproxy_backend_http_responses_total{job="ampelmann", code="5xx"}[5m] )) / sum by (backend)(rate( haproxy_backend_http_responses_total{job="ampelmann"}[5m] ))
  • 23. - record: backend:error_slo:percent labels: backend: "api-v4" expr: 0.1 - record: backend:error_slo:percent labels: backend: "api-v3" expr: 0.2 # ... Many more backends.
  • 24.
  • 25.
  • 26. SLO budget consumption Time window Burn rate Notification 2% 1 hour 14.4 Page 5% 6 hours 6 Page 10% 3 days 1 Ticket
  • 27. expr: ( job:slo_errors_per_request:ratio_rate1h{job="myjob"} > (14.4*0.001) or job:slo_errors_per_request:ratio_rate6h{job="myjob"} > (6*0.001) ) severity: page expr: job:slo_errors_per_request:ratio_rate3d{job="myjob"} > 0.001 severity: ticket
  • 28.
  • 29.
  • 30. - alert: AmpelmannErrorBudgetBurn expr: |2 ( 100 * backend:http_errors_per_response:ratio_rate1h > on (backend) 14.4 * backend:error_slo:percent ) and ( 100 * backend:http_errors_per_response:ratio_rate5m > on (backend) 14.4 * backend:error_slo:percent ) for: 2m labels: system: "{{$labels.backend}}" severity: "critical" window: "1h" annotations: summary: "a backend burns its error budget very fast" description: "Backend {{$labels.backend}} has returned {{ $value | printf `%.2f` }}% 5xx runbook: "http://runbooks.soundcloud.com/runbooks/ampelmann/#ampelmannerrorbudgetburn"
  • 31. Alert Long window Short window for duration Burn rate factor Error budget consumed Page 1h 5m 2m 14.4 2% Page 6h 30m 15m 6 5% Ticket 1d 2h 1h 3 10% Ticket 3d 6h 1h 1 10%
  • 32. Who gets the page?
  • 33. - alert: AmpelmannErrorBudgetBurn expr: |2 ( 100 * backend:http_errors_per_response:ratio_rate1h > on (backend) 14.4 * backend:error_slo:percent ) and ( 100 * backend:http_errors_per_response:ratio_rate5m > on (backend) 14.4 * backend:error_slo:percent ) for: 2m labels: system: "{{$labels.backend}}" severity: "critical" window: "1h" annotations: summary: "a backend burns its error budget very fast" description: "Backend {{$labels.backend}} has returned {{ $value | printf `%.2f` }}% 5xx runbook: "http://runbooks.soundcloud.com/runbooks/ampelmann/#ampelmannerrorbudgetburn"
  • 34. route: receiver: prodeng-warn group_by: - alertname - zone - system routes: - receiver: api-team-warn match: system: api-v4 routes: - receiver: api-team-crit match: severity: critical group_wait: 20s group_interval: 5m repeat_interval: 3h - receiver: api-team-info match: severity: info