SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
Migrating to Prometheus
What we learned
running it in production
Marco Pracucci @pracucci 22 November, 2017
I’m Marco - Software engineer moved to the
DevOps side
- Co-founder and former CTO at
Spreaker
3
Podcasting platform
Create, host, distribute and monetize your podcast
Why monitoring?
- Failures detection and alerting
- Get insights when things go wrong
- Analyze trends over the time
Our setup
The Spreaker
infrastructure spans
over 3 AWS regions
Our setup
We run Prometheus in each region,
to keep it close to
the monitored targets
Infrastructure
- Nodes
- Kubernetes cluster health
- VPN
- Logging
- Backups
- ...
What we monitor
Infrastructure
Applications
- Our applications (both running on VM and containers)
- Third-party applications (PostgreSQL, RabbitMQ, Redis, 
)
What we monitor
Infrastructure
Applications
External providers
Key business metrics
What we monitor
Simple yet powerful
Backed by a time-series database
A query language we love
Why Prometheus?
What is this
talk about?
What we learned
using it
(hope something will make sense to you too)
Examples are on
Prometheus v. 1
(but same concepts apply to Prometheus v. 2)
When you start monitoring a new system it’s important to
understand what are the key metrics.
Ask your self basic questions to identify key metrics:
- Is the system up and responsive?
- How much more traffic / queries can it sustain?
#1 - Identify your golden signals
The RED method by Tom Wilkie:
#1 - Identify your golden signals
Request rate
Error rate
Duration
requests / sec
errors %
response time
traffic
failures
performances
The RED method by Tom Wilkie:
#1 - Identify your golden signals
Request rate
Error rate
Duration
requests / sec
errors %
response time
traffic
failures
performances
Then add saturation monitoring:
Ie. CPU, memory, I/O, ...
- Discover targets
- Scrape metrics
- Store metrics
- Query language
- Evaluate alerting rules
A quick recap about the architecture
#2 - Monitor your golden signals
App #1
App #3
App #2
App #4
prometheus alert manager
pull metrics push alerts
node
node
Request rate = requests / sec
#2 - Monitor your golden signals
#TYPE http_requests_total counter
http_requests_total{method="GET",handler="viewUser",status="200"} 80
- Counter incremented on each request received
- By method, handler and response status code
Request rate = requests / sec
#2 - Monitor your golden signals
sum(rate(http_requests_total[1m])) {} 7.02
Group by method
sum(rate(http_requests_total[1m])) by (method) {method="GET"} 6.10
{method="POST"} 0.92
Error rate = total number of errors / total requests
#2 - Monitor your golden signals
sum(rate(http_requests_total{status=~"(4|5).*"}[1m])) /
sum(rate(http_requests_total[1m]))
{} 0.02 = 2%
Error rate by method and handler
#2 - Monitor your golden signals
sum(rate(http_requests_total{status=~"(4|5).*"}[1m])) by (method, handler) /
sum(rate(http_requests_total[1m])) by (method, handler)
{method="GET",handler="viewUser"} 0.015
{method="POST",handler="editUser"} 0.005
#2 - Monitor your golden signals
Average response time = sum of response times / number of requests
#TYPE http_requests_duration_seconds counter
http_requests_duration_seconds{method="GET",handler="viewUser",status="200"} 4.5
- Sum of all the response times
- By method, handler and response status code
#2 - Monitor your golden signals
Average response time = sum of response times / number of requests
sum(increase(http_requests_duration_seconds[1m])) /
sum(increase(http_requests_total[1m]))
{} 0.075 = 75 ms
- Get alerts in input
- Route alerts to receivers
We use email, slack,
opsgenie
 but supports
many more
#3 - Alert on golden signals
A quick recap about the architecture
App #1
App #3
App #2
App #4
prometheus alert manager
pull metrics push alerts
node
node
Alert on high error rate:
- Use % threshold
- Prefer without() over by() to keep an high observability
#3 - Alert on golden signals
ALERT HIGH_ERROR_RATE
ON sum(rate(http_requests_total{status=~"(4|5).*"}[1m])) without (status) /
sum(rate(http_requests_total[1m])) without(status)
> 0.01
FOR 5m
Prometheus v. 1 syntax
Alert on high response times:
- Use absolute value
- Prefer without() over by() to keep an high observability
ALERT HIGH_RESPONSE_TIMES
ON sum(increase(http_requests_duration_seconds[1m])) without (status) /
sum(increase(http_requests_total[1m])) without(status)
> 0.5
FOR 5m
#3 - Alert on golden signals
Prometheus v. 1 syntax
#4 - Dead targets
A quick recap about the architecture
PostgreSQL
Custom exporter
prometheus alert manager
pull metrics
push alerts
node
SQL queries
#4 - Dead targets
#TYPE postgres_up gauge
postgres_up{} 1
The exporter exports
ALERT POSTGRESQL_IS_DOWN
ON postgres_up == 0
FOR 5m
And we alert on it
#4 - Dead targets
What if Prometheus can’t scrape metrics from a target?
PostgreSQL
Custom exporter
prometheus alert manager
pull metrics
push alerts
node
SQL queries
#4 - Dead targets
Prometheus will not scrape postgres_up{} 0 because the exporter is down,
and our previous alert will never fire
ALERT POSTGRESQL_IS_DOWN
ON postgres_up == 0 or absent(postgres_up)
FOR 5m
We can improve the alert with absent()
ALERT HIGH_ERROR_RATE_ON_FRONTEND
ON 
 > 0.01
LABELS {
team="frontend",
severity="warning"
}
Use labels to define alert’s team and severity
#5 - Route alerts by team and severity
Prometheus v. 1 syntax
We support three levels of severity:
warning
error
critical
Slack
Slack + Email
Slack + Email + SMS / Phone call
#5 - Route alerts by team and severity
next business day
daylight (weekend included)
immediately
route:
routes:
# Team specific alerts
- match:
team: frontend
routes:
- match_re:
severity: critical
receiver: page-frontend-team-by-opsgenie
continue: true
- match_re:
severity: critical|error
receiver: page-frontend-team-by-email
continue: true
- receiver: page-frontend-team-by-slack
continue: false
Use child routes to route by team first, then severity:
#5 - Route alerts by team and severity
Route by team first
If team matches, enter the child routes
Send critical via opsgenie
Send critical and error via email
Always send via slack
If team did match, stop here
route:
routes:
# Team specific alerts
# ...
# Fallback to ops team
- match_re:
severity: critical
receiver: page-ops-team-by-opsgenie
continue: true
- match_re:
severity: error|critical
receiver: page-ops-team-by-email
continue: true
- receiver: page-ops-team-by-slack
If no team matches, fallback to ops team:
#5 - Route alerts by team and severity
Send critical via opsgenie
Send critical and error via email
Always send via slack
Document manual operations in an easy to read playbook,
and link it to the alert using ANNOTATIONS
#6 - Associate playbooks to alerts
ALERT HIGH_ERROR_RATE_ON_FRONTEND
ON 
 > 0.01
LABELS { team="frontend", severity="warning" }
ANNOTATIONS {
playbook="https://doc.spreaker.com/playbooks/high-error-rate-on-frontend"
}
Prometheus v. 1 syntax
Customize the alert messages, displaying the playbook too.
#6 - Associate playbooks to alerts
Both labels and annotations allow you to attach metadata to your alerts.
#7 - Labels and Annotations
ALERT HIGH_ERROR_RATE_ON_FRONTEND
ON 
 > 0.01
LABELS {
team="frontend",
severity="warning"
}
ANNOTATIONS {
playbook="https://..."
}
Prometheus v. 1 syntax
LABELS
- Information to identify an alert
- Read by a machine
ANNOTATIONS
- Extra information for the receiver
(ie. description)
- Read by an human
To recap
1. Keep it simple
2. Focus on metrics that bring you value
3. Ensure each alert is actionable
4. Write playbooks for manual intervention
5. Do not alert at all if you can automize the resolution
Thanks
Questions?
Marco Pracucci
If you liked it, follow me on Twitter:
@pracucci

Weitere Àhnliche Inhalte

Was ist angesagt?

Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
 
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Brian Brazil
 
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
Evolution of the Prometheus TSDB  (Percona Live Europe 2017)Evolution of the Prometheus TSDB  (Percona Live Europe 2017)
Evolution of the Prometheus TSDB (Percona Live Europe 2017)Brian Brazil
 
Ansible at FOSDEM (Ansible Dublin, 2016)
Ansible at FOSDEM (Ansible Dublin, 2016)Ansible at FOSDEM (Ansible Dublin, 2016)
Ansible at FOSDEM (Ansible Dublin, 2016)Brian Brazil
 
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Brian Brazil
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Brian Brazil
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Brian Brazil
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with PrometheusQAware GmbH
 
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)Brian Brazil
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Brian Brazil
 
Your Code is Wrong
Your Code is WrongYour Code is Wrong
Your Code is Wrongnathanmarz
 
The Epistemology of Software Engineering
The Epistemology of Software EngineeringThe Epistemology of Software Engineering
The Epistemology of Software Engineeringnathanmarz
 
End to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenEnd to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenParis Container Day
 
Predictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine IntelligencePredictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine IntelligenceNumenta
 
Evaluating Real-Time Anomaly Detection: The Numenta Anomaly Benchmark
Evaluating Real-Time Anomaly Detection: The Numenta Anomaly BenchmarkEvaluating Real-Time Anomaly Detection: The Numenta Anomaly Benchmark
Evaluating Real-Time Anomaly Detection: The Numenta Anomaly BenchmarkNumenta
 
Anomaly Detection Using the CLA
Anomaly Detection Using the CLAAnomaly Detection Using the CLA
Anomaly Detection Using the CLANumenta
 
Detecting Anomalies in Streaming Data
Detecting Anomalies in Streaming DataDetecting Anomalies in Streaming Data
Detecting Anomalies in Streaming DataSubutai Ahmad
 
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...Srinath Perera
 

Was ist angesagt? (20)

Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
 
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
Evolution of the Prometheus TSDB  (Percona Live Europe 2017)Evolution of the Prometheus TSDB  (Percona Live Europe 2017)
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
 
Ansible at FOSDEM (Ansible Dublin, 2016)
Ansible at FOSDEM (Ansible Dublin, 2016)Ansible at FOSDEM (Ansible Dublin, 2016)
Ansible at FOSDEM (Ansible Dublin, 2016)
 
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with Prometheus
 
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
 
Your Code is Wrong
Your Code is WrongYour Code is Wrong
Your Code is Wrong
 
The Epistemology of Software Engineering
The Epistemology of Software EngineeringThe Epistemology of Software Engineering
The Epistemology of Software Engineering
 
End to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenEnd to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max Inden
 
Predictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine IntelligencePredictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine Intelligence
 
Evaluating Real-Time Anomaly Detection: The Numenta Anomaly Benchmark
Evaluating Real-Time Anomaly Detection: The Numenta Anomaly BenchmarkEvaluating Real-Time Anomaly Detection: The Numenta Anomaly Benchmark
Evaluating Real-Time Anomaly Detection: The Numenta Anomaly Benchmark
 
Anomaly Detection Using the CLA
Anomaly Detection Using the CLAAnomaly Detection Using the CLA
Anomaly Detection Using the CLA
 
Semaphore
SemaphoreSemaphore
Semaphore
 
Detecting Anomalies in Streaming Data
Detecting Anomalies in Streaming DataDetecting Anomalies in Streaming Data
Detecting Anomalies in Streaming Data
 
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
 

Ähnlich wie Migrating to Prometheus: what we learned running it in production

Monitoring MySQL with Prometheus and Grafana
Monitoring MySQL with Prometheus and GrafanaMonitoring MySQL with Prometheus and Grafana
Monitoring MySQL with Prometheus and GrafanaJulien Pivotto
 
OSMC 2017 | Monitoring MySQL with Prometheus and Grafana by Julien Pivotto
OSMC 2017 | Monitoring  MySQL with Prometheus and Grafana by Julien PivottoOSMC 2017 | Monitoring  MySQL with Prometheus and Grafana by Julien Pivotto
OSMC 2017 | Monitoring MySQL with Prometheus and Grafana by Julien PivottoNETWAYS
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemAccumulo Summit
 
Monitor your Java application with Prometheus Stack
Monitor your Java application with Prometheus StackMonitor your Java application with Prometheus Stack
Monitor your Java application with Prometheus StackWojciech BarczyƄski
 
Regain Control Thanks To Prometheus
Regain Control Thanks To PrometheusRegain Control Thanks To Prometheus
Regain Control Thanks To PrometheusEtienne Coutaud
 
Overcoming scalability issues in your prometheus ecosystem
Overcoming scalability issues in your prometheus ecosystemOvercoming scalability issues in your prometheus ecosystem
Overcoming scalability issues in your prometheus ecosystemNebulaworks
 
Swift distributed tracing method and tools v2
Swift distributed tracing method and tools v2Swift distributed tracing method and tools v2
Swift distributed tracing method and tools v2zhang hua
 
Overcoming (organizational) scalability issues in your Prometheus ecosystem
Overcoming (organizational) scalability issues in your Prometheus ecosystemOvercoming (organizational) scalability issues in your Prometheus ecosystem
Overcoming (organizational) scalability issues in your Prometheus ecosystemQAware GmbH
 
Basic of jMeter
Basic of jMeter Basic of jMeter
Basic of jMeter Shub
 
Smpant Transact09
Smpant Transact09Smpant Transact09
Smpant Transact09smpant
 
Banv
BanvBanv
Banvnetvis
 
Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetrypphaal
 
When Web Services Go Bad
When Web Services Go BadWhen Web Services Go Bad
When Web Services Go BadSteve Loughran
 
DevOpsDays Singapore - Continuous Auditing with Compliance as Code
DevOpsDays Singapore - Continuous Auditing with Compliance as CodeDevOpsDays Singapore - Continuous Auditing with Compliance as Code
DevOpsDays Singapore - Continuous Auditing with Compliance as CodeMatt Ray
 
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...NETWAYS
 
Getting to Know MySQL Enterprise Monitor
Getting to Know MySQL Enterprise MonitorGetting to Know MySQL Enterprise Monitor
Getting to Know MySQL Enterprise MonitorMark Leith
 
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with PrometheusOpenStack Korea Community
 
Free OpManager training_Part 1- Discovery & classification
Free OpManager training_Part 1- Discovery & classificationFree OpManager training_Part 1- Discovery & classification
Free OpManager training_Part 1- Discovery & classificationManageEngine, Zoho Corporation
 

Ähnlich wie Migrating to Prometheus: what we learned running it in production (20)

Monitoring MySQL with Prometheus and Grafana
Monitoring MySQL with Prometheus and GrafanaMonitoring MySQL with Prometheus and Grafana
Monitoring MySQL with Prometheus and Grafana
 
OSMC 2017 | Monitoring MySQL with Prometheus and Grafana by Julien Pivotto
OSMC 2017 | Monitoring  MySQL with Prometheus and Grafana by Julien PivottoOSMC 2017 | Monitoring  MySQL with Prometheus and Grafana by Julien Pivotto
OSMC 2017 | Monitoring MySQL with Prometheus and Grafana by Julien Pivotto
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
 
Monitor your Java application with Prometheus Stack
Monitor your Java application with Prometheus StackMonitor your Java application with Prometheus Stack
Monitor your Java application with Prometheus Stack
 
Regain Control Thanks To Prometheus
Regain Control Thanks To PrometheusRegain Control Thanks To Prometheus
Regain Control Thanks To Prometheus
 
Overcoming scalability issues in your prometheus ecosystem
Overcoming scalability issues in your prometheus ecosystemOvercoming scalability issues in your prometheus ecosystem
Overcoming scalability issues in your prometheus ecosystem
 
Swift distributed tracing method and tools v2
Swift distributed tracing method and tools v2Swift distributed tracing method and tools v2
Swift distributed tracing method and tools v2
 
Overcoming (organizational) scalability issues in your Prometheus ecosystem
Overcoming (organizational) scalability issues in your Prometheus ecosystemOvercoming (organizational) scalability issues in your Prometheus ecosystem
Overcoming (organizational) scalability issues in your Prometheus ecosystem
 
Basic of jMeter
Basic of jMeter Basic of jMeter
Basic of jMeter
 
Smpant Transact09
Smpant Transact09Smpant Transact09
Smpant Transact09
 
Banv
BanvBanv
Banv
 
Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetry
 
When Web Services Go Bad
When Web Services Go BadWhen Web Services Go Bad
When Web Services Go Bad
 
Attques web
Attques webAttques web
Attques web
 
DevOpsDays Singapore - Continuous Auditing with Compliance as Code
DevOpsDays Singapore - Continuous Auditing with Compliance as CodeDevOpsDays Singapore - Continuous Auditing with Compliance as Code
DevOpsDays Singapore - Continuous Auditing with Compliance as Code
 
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
 
Getting to Know MySQL Enterprise Monitor
Getting to Know MySQL Enterprise MonitorGetting to Know MySQL Enterprise Monitor
Getting to Know MySQL Enterprise Monitor
 
QSpiders - Installation and Brief Dose of Load Runner
QSpiders - Installation and Brief Dose of Load RunnerQSpiders - Installation and Brief Dose of Load Runner
QSpiders - Installation and Brief Dose of Load Runner
 
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
 
Free OpManager training_Part 1- Discovery & classification
Free OpManager training_Part 1- Discovery & classificationFree OpManager training_Part 1- Discovery & classification
Free OpManager training_Part 1- Discovery & classification
 

KĂŒrzlich hochgeladen

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

KĂŒrzlich hochgeladen (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Migrating to Prometheus: what we learned running it in production

  • 1. Migrating to Prometheus What we learned running it in production Marco Pracucci @pracucci 22 November, 2017
  • 2. I’m Marco - Software engineer moved to the DevOps side - Co-founder and former CTO at Spreaker
  • 3. 3 Podcasting platform Create, host, distribute and monetize your podcast
  • 4. Why monitoring? - Failures detection and alerting - Get insights when things go wrong - Analyze trends over the time
  • 5. Our setup The Spreaker infrastructure spans over 3 AWS regions
  • 6. Our setup We run Prometheus in each region, to keep it close to the monitored targets
  • 7. Infrastructure - Nodes - Kubernetes cluster health - VPN - Logging - Backups - ... What we monitor
  • 8. Infrastructure Applications - Our applications (both running on VM and containers) - Third-party applications (PostgreSQL, RabbitMQ, Redis, 
) What we monitor
  • 10. Simple yet powerful Backed by a time-series database A query language we love Why Prometheus?
  • 12. What we learned using it (hope something will make sense to you too)
  • 13. Examples are on Prometheus v. 1 (but same concepts apply to Prometheus v. 2)
  • 14. When you start monitoring a new system it’s important to understand what are the key metrics. Ask your self basic questions to identify key metrics: - Is the system up and responsive? - How much more traffic / queries can it sustain? #1 - Identify your golden signals
  • 15. The RED method by Tom Wilkie: #1 - Identify your golden signals Request rate Error rate Duration requests / sec errors % response time traffic failures performances
  • 16. The RED method by Tom Wilkie: #1 - Identify your golden signals Request rate Error rate Duration requests / sec errors % response time traffic failures performances Then add saturation monitoring: Ie. CPU, memory, I/O, ...
  • 17. - Discover targets - Scrape metrics - Store metrics - Query language - Evaluate alerting rules A quick recap about the architecture #2 - Monitor your golden signals App #1 App #3 App #2 App #4 prometheus alert manager pull metrics push alerts node node
  • 18. Request rate = requests / sec #2 - Monitor your golden signals #TYPE http_requests_total counter http_requests_total{method="GET",handler="viewUser",status="200"} 80 - Counter incremented on each request received - By method, handler and response status code
  • 19. Request rate = requests / sec #2 - Monitor your golden signals sum(rate(http_requests_total[1m])) {} 7.02 Group by method sum(rate(http_requests_total[1m])) by (method) {method="GET"} 6.10 {method="POST"} 0.92
  • 20. Error rate = total number of errors / total requests #2 - Monitor your golden signals sum(rate(http_requests_total{status=~"(4|5).*"}[1m])) / sum(rate(http_requests_total[1m])) {} 0.02 = 2%
  • 21. Error rate by method and handler #2 - Monitor your golden signals sum(rate(http_requests_total{status=~"(4|5).*"}[1m])) by (method, handler) / sum(rate(http_requests_total[1m])) by (method, handler) {method="GET",handler="viewUser"} 0.015 {method="POST",handler="editUser"} 0.005
  • 22. #2 - Monitor your golden signals Average response time = sum of response times / number of requests #TYPE http_requests_duration_seconds counter http_requests_duration_seconds{method="GET",handler="viewUser",status="200"} 4.5 - Sum of all the response times - By method, handler and response status code
  • 23. #2 - Monitor your golden signals Average response time = sum of response times / number of requests sum(increase(http_requests_duration_seconds[1m])) / sum(increase(http_requests_total[1m])) {} 0.075 = 75 ms
  • 24. - Get alerts in input - Route alerts to receivers We use email, slack, opsgenie
 but supports many more #3 - Alert on golden signals A quick recap about the architecture App #1 App #3 App #2 App #4 prometheus alert manager pull metrics push alerts node node
  • 25. Alert on high error rate: - Use % threshold - Prefer without() over by() to keep an high observability #3 - Alert on golden signals ALERT HIGH_ERROR_RATE ON sum(rate(http_requests_total{status=~"(4|5).*"}[1m])) without (status) / sum(rate(http_requests_total[1m])) without(status) > 0.01 FOR 5m Prometheus v. 1 syntax
  • 26. Alert on high response times: - Use absolute value - Prefer without() over by() to keep an high observability ALERT HIGH_RESPONSE_TIMES ON sum(increase(http_requests_duration_seconds[1m])) without (status) / sum(increase(http_requests_total[1m])) without(status) > 0.5 FOR 5m #3 - Alert on golden signals Prometheus v. 1 syntax
  • 27. #4 - Dead targets A quick recap about the architecture PostgreSQL Custom exporter prometheus alert manager pull metrics push alerts node SQL queries
  • 28. #4 - Dead targets #TYPE postgres_up gauge postgres_up{} 1 The exporter exports ALERT POSTGRESQL_IS_DOWN ON postgres_up == 0 FOR 5m And we alert on it
  • 29. #4 - Dead targets What if Prometheus can’t scrape metrics from a target? PostgreSQL Custom exporter prometheus alert manager pull metrics push alerts node SQL queries
  • 30. #4 - Dead targets Prometheus will not scrape postgres_up{} 0 because the exporter is down, and our previous alert will never fire ALERT POSTGRESQL_IS_DOWN ON postgres_up == 0 or absent(postgres_up) FOR 5m We can improve the alert with absent()
  • 31. ALERT HIGH_ERROR_RATE_ON_FRONTEND ON 
 > 0.01 LABELS { team="frontend", severity="warning" } Use labels to define alert’s team and severity #5 - Route alerts by team and severity Prometheus v. 1 syntax
  • 32. We support three levels of severity: warning error critical Slack Slack + Email Slack + Email + SMS / Phone call #5 - Route alerts by team and severity next business day daylight (weekend included) immediately
  • 33. route: routes: # Team specific alerts - match: team: frontend routes: - match_re: severity: critical receiver: page-frontend-team-by-opsgenie continue: true - match_re: severity: critical|error receiver: page-frontend-team-by-email continue: true - receiver: page-frontend-team-by-slack continue: false Use child routes to route by team first, then severity: #5 - Route alerts by team and severity Route by team first If team matches, enter the child routes Send critical via opsgenie Send critical and error via email Always send via slack If team did match, stop here
  • 34. route: routes: # Team specific alerts # ... # Fallback to ops team - match_re: severity: critical receiver: page-ops-team-by-opsgenie continue: true - match_re: severity: error|critical receiver: page-ops-team-by-email continue: true - receiver: page-ops-team-by-slack If no team matches, fallback to ops team: #5 - Route alerts by team and severity Send critical via opsgenie Send critical and error via email Always send via slack
  • 35. Document manual operations in an easy to read playbook, and link it to the alert using ANNOTATIONS #6 - Associate playbooks to alerts ALERT HIGH_ERROR_RATE_ON_FRONTEND ON 
 > 0.01 LABELS { team="frontend", severity="warning" } ANNOTATIONS { playbook="https://doc.spreaker.com/playbooks/high-error-rate-on-frontend" } Prometheus v. 1 syntax
  • 36. Customize the alert messages, displaying the playbook too. #6 - Associate playbooks to alerts
  • 37. Both labels and annotations allow you to attach metadata to your alerts. #7 - Labels and Annotations ALERT HIGH_ERROR_RATE_ON_FRONTEND ON 
 > 0.01 LABELS { team="frontend", severity="warning" } ANNOTATIONS { playbook="https://..." } Prometheus v. 1 syntax LABELS - Information to identify an alert - Read by a machine ANNOTATIONS - Extra information for the receiver (ie. description) - Read by an human
  • 38. To recap 1. Keep it simple 2. Focus on metrics that bring you value 3. Ensure each alert is actionable 4. Write playbooks for manual intervention 5. Do not alert at all if you can automize the resolution
  • 39. Thanks Questions? Marco Pracucci If you liked it, follow me on Twitter: @pracucci