SlideShare a Scribd company logo
1 of 46
Systems Monitoring with Prometheus
Devops Ireland, April 2015
Brian Brazil
Senior Software Engineer
Boxever
What is monitoring?
What is monitoring?
Host-based checks?
• Typically shell scripts with success/fail
• Failure causes alerts
• More blackbox than whitebox
• About machines, not services
Brian’s Pet Peeve #1
Thinking in terms of machines rather than services.
It’s the future, it’s not the “Webserver machine” it’s one
machine what happens to run a Webserver as part of the
Webserver service.
What is monitoring?
Highly granular information about a subsystem?
• Tends to be focused on one subsystem, such as
incoming http requests
• No visibility into the rest of the system
What is monitoring?
High frequency high granularity profiling?
• Great for debugging once you’ve narrowed down the
problem
• Not so useful for general monitoring
What is monitoring?
Tailing logs?
• Easy to miss something
• Tend to get very noisy
• We have computers, why are humans doing repetitive
tasks?
Step Back
Why do we want monitoring?
Why: Alerting
We want to know when things go wrong
We want to know when things aren’t quite right
We want to know in advance of problems
Why: Debugging
When something is up, you need to debug.
You want to go from high-level problem, and drill down to
what’s causing it. Need to be able to reason about things.
Sometimes want to go from code back to metrics.
Brian’s Pet Peeve #2
Instrumentation that you need to read the code to
understand.
Make the names such that a random person not intimately
familiar with the system would have a good chance at
guessing what it means. Specify your units.
Why: Trending
How the various bits of a system are being used.
For example, how many static requests per dynamic
request? How many sessions active at once? How many hit
a certain corner case?
For some stats, also want to know how they change over
time for capacity planning and design discussions.
A different approach
What if we instrumented everything?
• RPCs
• Interfaces between subsystems
• Business logic
• Every time you’d log something
What if we monitored systems and subsystems to know
how everything is generally doing?
That’s a lot of metrics
That could be tens of thousands of codepoints across an
entire system.
You’d need some way to make it easy to instrument all
code, not just the externally facing parts of applications.
You’d need something able to handle a million time series.
Presenting Prometheus
An open-source service monitoring system and time series
database.
Started in 2012, primarily developed in Soundcloud with
committers also in Boxever and Docker.
Publicly announced January 2015, many contributions and
users since then.
Architecture
The Server
• Can handle over a million time series in one instance
• No dependencies such as HBase or Cassandra
• Stores data on local disk
• Written in Go
• Easy to run
Data Model
Tired of aggregating and alerting off metrics like http.
responses.500.myserver.mydc.production?
Time series have structured key-value pairs, e.g.
http_responses_total{
response_code=”500”,instance=”myserver”,
dc=”mydc”,env=”production”}
Brian’s Pet Peeve #3
Munging structured data in a way that loses the structure
Is it so much to ask for some escaping, or at least sanitizing
any separators in the data?
Query Language
Aggregation based on the key-value labels
Arbitrarily complex math
And all of this can be used in pre-computed rules and alerts
Query Language: Example
Column families with the 10 highest read rates per second
topk(10,
sum by(job, keyspace, columnfamily) (
rate(cassandra_columnfamily_readlatency[5m])
)
)
Client Libraries
How you instrument your code
• Official: Python, Java, Go, Ruby
• Unofficial: Bash, NodeJS, .Net
Text-based format, easy to write new client libraries
Client Libraries: Example
@Summary(“request_latency_seconds”,
“Request latency”).time()
def process_request():
pass
Client Libraries: In and Out
Client libraries don’t tie you to Prometheus instrumentation
Custom collectors allow pulling data from other
instrumentation systems into Prometheus client library
Similarly, can pull data out of client library and expose as
you wish
Client Libraries: What to Instrument
Client Libraries: What to Instrument
Everything
Things to have
• Client and server qps/errors/latency
• Every log message should be a metric
• Every failure should be a metric
• Threadpool/queue size, in progress, latency
• Business logic inputs and outputs
• Data sizes in/out
• Process cpu/ram/language internals (e.g. GC)
• Blackbox and end-to-end monitoring
Batch/Offline Processing Metrics
• Last time it succeeded
• Records processed/throughput
• Duration of major batch stages
• Heartbeats for end-to-end testing
Use the PushGateway for ephemeral jobs
Brian’s Pet Peeve #4
Wrapping instrumentation libraries to make them “simpler”
Tend to confuse abstractions, encourage bad practices and
make it difficult to write correct and useable instrumentation
e.g. Prometheus values are doubles, if you only allow ints
then end user has to do math to convert back to seconds
Speaking of Correct Instrumentation
It’s better to have math done in the server, not the client
Many instrumentation systems are exponentially decaying
Do you really want to do calculus during an outage?
Prometheus has monotonic counters
Races and missed scrapers don’t lose data
Integrations
More powerful data model needs integrations and
instrumentation written to take advantage of that
Machine (Node Exporter), HAProxy, CloudWatch, Statsd,
Collectd, JMX, Mesos, Consul, MySQL
Direct instrumentation: cadvisor, etcd
Dashboards
• Promdash: Ruby on Rails web app
• Console templates: More power for those who like
checking things in
• Expression browser: Ad-hoc queries
• JSON interface: Roll your own
Dashboards
Dashboards
What to put on dashboards?
Dashboards
Goal: Make it easy to logically drill down a problem
Most services are a graph.
Make it easy to go from a console about one service, see
which of it’s backends is the problem, and repeat.
Brian’s Pet Peeve #5
Dashboard anti-patterns:
● Graph of a hundred plots
● Page of a thousand graphs
● Consoles that their creator can barely understand
● “Put it on a console somewhere”
Dashboard Guidelines
• Don’t put every possible metric on the dashboard
• Focus on the top few metrics, based on the most likely
failure modes and things you’ll want to know
• No more than 5 graphs per console, 5 plots per graph
• Have units, y-labels, legends and descriptions
• Split out or trim if dashboards are getting too complex
• It’s difficult for a dashboard to serve two masters
• Dashboards are not for alerting
Alerting
Alertmanager aggregates alerts from Prometheus servers
Supports notifications to Pagerduty, Email, Pushover
Best practices:
• Alert on symptoms not causes
• Have a way to deal with non-critical alerts
What to Alert On
Online Serving: Overall latency, errors
Offline Processing: Propagation/Processing Delay
Batch Jobs: When it last Suceeded
The Live Demo
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work
The Future
Many features on roadmap:
• Service discovery
• Federation
• Long term storage
• HA Alertmanager
• More exporters, client libraries and integrations
Final Word
Systems Monitoring is your first port of call in an
emergency, keep it working without needing lots of effort.
Prometheus is awesome - can be lead to non-critical data
taking lots of management, crowding out critical metrics.
At some point, have to move non-critical things to generic
data processing system.
Finaler Word
How do you know your monitoring system is good?
• When you have superb monitoring for everything?
• When it causes a HDD to fail?
• When it finds two bugs in Go’s DNS library?
Finaler Word
How do you know your monitoring system is good?
• When you have superb monitoring for everything?
• When it causes a HDD to fail?
• When it finds two bugs in Go’s DNS library?
No, when the unittests find a bug in your filesystem!
Try it out!
http://www.boxever.com/tag/monitoring has step-by-step instructions on
monitoring:
• Machines
• Cassandra
• HAProxy
• Python Batch host-based jobs
• Java applications
Problems?
We’re on #prometheus on Freenode
More Information
http://prometheus.io
http://www.boxever.com/tag/monitoring
Python Ireland, April 8th
SREcon15 Europe, May 14-15th

More Related Content

What's hot

An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheusKasper Nissen
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaArvind Kumar G.S
 
Introduction to Prometheus
Introduction to PrometheusIntroduction to Prometheus
Introduction to PrometheusJulien Pivotto
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Brian Brazil
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaSyah Dwi Prihatmoko
 
Introduction to Prometheus and Cortex (WOUG)
Introduction to Prometheus and Cortex (WOUG)Introduction to Prometheus and Cortex (WOUG)
Introduction to Prometheus and Cortex (WOUG)Weaveworks
 
End to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenEnd to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenParis Container Day
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basicsJuraj Hantak
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaSridhar Kumar N
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with PrometheusQAware GmbH
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Brian Brazil
 
How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?Wojciech Barczyński
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)Lucas Jellema
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...LibbySchulze
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with PrometheusShiao-An Yuan
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheusCeline George
 
Kubernetes Observability with Prometheus by Example
Kubernetes Observability with Prometheus by ExampleKubernetes Observability with Prometheus by Example
Kubernetes Observability with Prometheus by ExampleThomas Riley
 

What's hot (20)

Cloud Monitoring tool Grafana
Cloud Monitoring  tool Grafana Cloud Monitoring  tool Grafana
Cloud Monitoring tool Grafana
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Introduction to Prometheus
Introduction to PrometheusIntroduction to Prometheus
Introduction to Prometheus
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and Grafana
 
Introduction to Prometheus and Cortex (WOUG)
Introduction to Prometheus and Cortex (WOUG)Introduction to Prometheus and Cortex (WOUG)
Introduction to Prometheus and Cortex (WOUG)
 
End to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenEnd to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max Inden
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basics
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with Prometheus
 
Prometheus and Grafana
Prometheus and GrafanaPrometheus and Grafana
Prometheus and Grafana
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
 
How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
 
Kubernetes Observability with Prometheus by Example
Kubernetes Observability with Prometheus by ExampleKubernetes Observability with Prometheus by Example
Kubernetes Observability with Prometheus by Example
 

Similar to Systems Monitoring with Prometheus (Devops Ireland April 2015)

Lotuscript for large systems
Lotuscript for large systemsLotuscript for large systems
Lotuscript for large systemsBill Buchan
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
The Business Value of PaaS Automation - Kieron Sambrook-Smith - Presentation ...
The Business Value of PaaS Automation - Kieron Sambrook-Smith - Presentation ...The Business Value of PaaS Automation - Kieron Sambrook-Smith - Presentation ...
The Business Value of PaaS Automation - Kieron Sambrook-Smith - Presentation ...eZ Systems
 
Chris Mathias Presents Advanced API Design Considerations at LA CTO Forum
Chris Mathias Presents Advanced API Design Considerations at LA CTO ForumChris Mathias Presents Advanced API Design Considerations at LA CTO Forum
Chris Mathias Presents Advanced API Design Considerations at LA CTO ForumChris Mathias
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
(SPOT205) 5 Lessons for Managing Massive IT Transformation ProjectsAmazon Web Services
 
Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)Brian Brazil
 
Eric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler
 
Best Practices for WordPress in Enterprise
Best Practices for WordPress in EnterpriseBest Practices for WordPress in Enterprise
Best Practices for WordPress in EnterpriseTaylor Lovett
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
How Open Source / Open Technology Could Help On Your Project
How Open Source / Open Technology Could Help On Your ProjectHow Open Source / Open Technology Could Help On Your Project
How Open Source / Open Technology Could Help On Your ProjectWan Leung Wong
 
SharePoint Performance Monitoring with Sean P. McDonough
SharePoint Performance Monitoring with Sean P. McDonoughSharePoint Performance Monitoring with Sean P. McDonough
SharePoint Performance Monitoring with Sean P. McDonoughGabrijela Orsag
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdbjixuan1989
 
Using Apache Camel as AKKA
Using Apache Camel as AKKAUsing Apache Camel as AKKA
Using Apache Camel as AKKAJohan Edstrom
 
An Introduction to Microservices
An Introduction to MicroservicesAn Introduction to Microservices
An Introduction to MicroservicesAd van der Veer
 
Design for Scale / Surge 2010
Design for Scale / Surge 2010Design for Scale / Surge 2010
Design for Scale / Surge 2010Christopher Brown
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
 

Similar to Systems Monitoring with Prometheus (Devops Ireland April 2015) (20)

Dev Ops without the Ops
Dev Ops without the OpsDev Ops without the Ops
Dev Ops without the Ops
 
Lotuscript for large systems
Lotuscript for large systemsLotuscript for large systems
Lotuscript for large systems
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
The Business Value of PaaS Automation - Kieron Sambrook-Smith - Presentation ...
The Business Value of PaaS Automation - Kieron Sambrook-Smith - Presentation ...The Business Value of PaaS Automation - Kieron Sambrook-Smith - Presentation ...
The Business Value of PaaS Automation - Kieron Sambrook-Smith - Presentation ...
 
Chris Mathias Presents Advanced API Design Considerations at LA CTO Forum
Chris Mathias Presents Advanced API Design Considerations at LA CTO ForumChris Mathias Presents Advanced API Design Considerations at LA CTO Forum
Chris Mathias Presents Advanced API Design Considerations at LA CTO Forum
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
 
Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)
 
Eric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New Contexts
 
Best Practices for WordPress in Enterprise
Best Practices for WordPress in EnterpriseBest Practices for WordPress in Enterprise
Best Practices for WordPress in Enterprise
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Redundant devops
Redundant devopsRedundant devops
Redundant devops
 
How Open Source / Open Technology Could Help On Your Project
How Open Source / Open Technology Could Help On Your ProjectHow Open Source / Open Technology Could Help On Your Project
How Open Source / Open Technology Could Help On Your Project
 
SharePoint Performance Monitoring with Sean P. McDonough
SharePoint Performance Monitoring with Sean P. McDonoughSharePoint Performance Monitoring with Sean P. McDonough
SharePoint Performance Monitoring with Sean P. McDonough
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Using Apache Camel as AKKA
Using Apache Camel as AKKAUsing Apache Camel as AKKA
Using Apache Camel as AKKA
 
soa
soasoa
soa
 
An Introduction to Microservices
An Introduction to MicroservicesAn Introduction to Microservices
An Introduction to Microservices
 
Design for Scale / Surge 2010
Design for Scale / Surge 2010Design for Scale / Surge 2010
Design for Scale / Surge 2010
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 

More from Brian Brazil

OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)Brian Brazil
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Brian Brazil
 
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Brian Brazil
 
Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)Brian Brazil
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Brian Brazil
 
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Brian Brazil
 
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Brian Brazil
 
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
Evolution of the Prometheus TSDB  (Percona Live Europe 2017)Evolution of the Prometheus TSDB  (Percona Live Europe 2017)
Evolution of the Prometheus TSDB (Percona Live Europe 2017)Brian Brazil
 
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Brian Brazil
 
Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)Brian Brazil
 
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Brian Brazil
 
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Brian Brazil
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)Brian Brazil
 
Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Brian Brazil
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Brian Brazil
 
So You Want to Write an Exporter
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an ExporterBrian Brazil
 
An Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQLAn Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQLBrian Brazil
 
Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Brian Brazil
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Brian Brazil
 

More from Brian Brazil (20)

OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)
 
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
 
Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)
 
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
 
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
 
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
Evolution of the Prometheus TSDB  (Percona Live Europe 2017)Evolution of the Prometheus TSDB  (Percona Live Europe 2017)
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
 
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
 
Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)
 
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
 
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
 
Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
 
So You Want to Write an Exporter
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an Exporter
 
An Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQLAn Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQL
 
Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 

Recently uploaded

ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxNIMMANAGANTI RAMAKRISHNA
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxAndrieCagasanAkio
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
Cybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best PracticesCybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best PracticesLumiverse Solutions Pvt Ltd
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119APNIC
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxMario
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxmibuzondetrabajo
 

Recently uploaded (9)

ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptx
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptx
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
Cybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best PracticesCybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best Practices
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptx
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptx
 

Systems Monitoring with Prometheus (Devops Ireland April 2015)

  • 1. Systems Monitoring with Prometheus Devops Ireland, April 2015 Brian Brazil Senior Software Engineer Boxever
  • 3. What is monitoring? Host-based checks? • Typically shell scripts with success/fail • Failure causes alerts • More blackbox than whitebox • About machines, not services
  • 4. Brian’s Pet Peeve #1 Thinking in terms of machines rather than services. It’s the future, it’s not the “Webserver machine” it’s one machine what happens to run a Webserver as part of the Webserver service.
  • 5. What is monitoring? Highly granular information about a subsystem? • Tends to be focused on one subsystem, such as incoming http requests • No visibility into the rest of the system
  • 6. What is monitoring? High frequency high granularity profiling? • Great for debugging once you’ve narrowed down the problem • Not so useful for general monitoring
  • 7. What is monitoring? Tailing logs? • Easy to miss something • Tend to get very noisy • We have computers, why are humans doing repetitive tasks?
  • 8. Step Back Why do we want monitoring?
  • 9. Why: Alerting We want to know when things go wrong We want to know when things aren’t quite right We want to know in advance of problems
  • 10. Why: Debugging When something is up, you need to debug. You want to go from high-level problem, and drill down to what’s causing it. Need to be able to reason about things. Sometimes want to go from code back to metrics.
  • 11. Brian’s Pet Peeve #2 Instrumentation that you need to read the code to understand. Make the names such that a random person not intimately familiar with the system would have a good chance at guessing what it means. Specify your units.
  • 12. Why: Trending How the various bits of a system are being used. For example, how many static requests per dynamic request? How many sessions active at once? How many hit a certain corner case? For some stats, also want to know how they change over time for capacity planning and design discussions.
  • 13. A different approach What if we instrumented everything? • RPCs • Interfaces between subsystems • Business logic • Every time you’d log something What if we monitored systems and subsystems to know how everything is generally doing?
  • 14. That’s a lot of metrics That could be tens of thousands of codepoints across an entire system. You’d need some way to make it easy to instrument all code, not just the externally facing parts of applications. You’d need something able to handle a million time series.
  • 15. Presenting Prometheus An open-source service monitoring system and time series database. Started in 2012, primarily developed in Soundcloud with committers also in Boxever and Docker. Publicly announced January 2015, many contributions and users since then.
  • 17. The Server • Can handle over a million time series in one instance • No dependencies such as HBase or Cassandra • Stores data on local disk • Written in Go • Easy to run
  • 18. Data Model Tired of aggregating and alerting off metrics like http. responses.500.myserver.mydc.production? Time series have structured key-value pairs, e.g. http_responses_total{ response_code=”500”,instance=”myserver”, dc=”mydc”,env=”production”}
  • 19. Brian’s Pet Peeve #3 Munging structured data in a way that loses the structure Is it so much to ask for some escaping, or at least sanitizing any separators in the data?
  • 20. Query Language Aggregation based on the key-value labels Arbitrarily complex math And all of this can be used in pre-computed rules and alerts
  • 21. Query Language: Example Column families with the 10 highest read rates per second topk(10, sum by(job, keyspace, columnfamily) ( rate(cassandra_columnfamily_readlatency[5m]) ) )
  • 22. Client Libraries How you instrument your code • Official: Python, Java, Go, Ruby • Unofficial: Bash, NodeJS, .Net Text-based format, easy to write new client libraries
  • 24. Client Libraries: In and Out Client libraries don’t tie you to Prometheus instrumentation Custom collectors allow pulling data from other instrumentation systems into Prometheus client library Similarly, can pull data out of client library and expose as you wish
  • 25. Client Libraries: What to Instrument
  • 26. Client Libraries: What to Instrument Everything
  • 27. Things to have • Client and server qps/errors/latency • Every log message should be a metric • Every failure should be a metric • Threadpool/queue size, in progress, latency • Business logic inputs and outputs • Data sizes in/out • Process cpu/ram/language internals (e.g. GC) • Blackbox and end-to-end monitoring
  • 28. Batch/Offline Processing Metrics • Last time it succeeded • Records processed/throughput • Duration of major batch stages • Heartbeats for end-to-end testing Use the PushGateway for ephemeral jobs
  • 29. Brian’s Pet Peeve #4 Wrapping instrumentation libraries to make them “simpler” Tend to confuse abstractions, encourage bad practices and make it difficult to write correct and useable instrumentation e.g. Prometheus values are doubles, if you only allow ints then end user has to do math to convert back to seconds
  • 30. Speaking of Correct Instrumentation It’s better to have math done in the server, not the client Many instrumentation systems are exponentially decaying Do you really want to do calculus during an outage? Prometheus has monotonic counters Races and missed scrapers don’t lose data
  • 31. Integrations More powerful data model needs integrations and instrumentation written to take advantage of that Machine (Node Exporter), HAProxy, CloudWatch, Statsd, Collectd, JMX, Mesos, Consul, MySQL Direct instrumentation: cadvisor, etcd
  • 32. Dashboards • Promdash: Ruby on Rails web app • Console templates: More power for those who like checking things in • Expression browser: Ad-hoc queries • JSON interface: Roll your own
  • 34. Dashboards What to put on dashboards?
  • 35. Dashboards Goal: Make it easy to logically drill down a problem Most services are a graph. Make it easy to go from a console about one service, see which of it’s backends is the problem, and repeat.
  • 36. Brian’s Pet Peeve #5 Dashboard anti-patterns: ● Graph of a hundred plots ● Page of a thousand graphs ● Consoles that their creator can barely understand ● “Put it on a console somewhere”
  • 37. Dashboard Guidelines • Don’t put every possible metric on the dashboard • Focus on the top few metrics, based on the most likely failure modes and things you’ll want to know • No more than 5 graphs per console, 5 plots per graph • Have units, y-labels, legends and descriptions • Split out or trim if dashboards are getting too complex • It’s difficult for a dashboard to serve two masters • Dashboards are not for alerting
  • 38. Alerting Alertmanager aggregates alerts from Prometheus servers Supports notifications to Pagerduty, Email, Pushover Best practices: • Alert on symptoms not causes • Have a way to deal with non-critical alerts
  • 39. What to Alert On Online Serving: Overall latency, errors Offline Processing: Propagation/Processing Delay Batch Jobs: When it last Suceeded
  • 40. The Live Demo please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work please work
  • 41. The Future Many features on roadmap: • Service discovery • Federation • Long term storage • HA Alertmanager • More exporters, client libraries and integrations
  • 42. Final Word Systems Monitoring is your first port of call in an emergency, keep it working without needing lots of effort. Prometheus is awesome - can be lead to non-critical data taking lots of management, crowding out critical metrics. At some point, have to move non-critical things to generic data processing system.
  • 43. Finaler Word How do you know your monitoring system is good? • When you have superb monitoring for everything? • When it causes a HDD to fail? • When it finds two bugs in Go’s DNS library?
  • 44. Finaler Word How do you know your monitoring system is good? • When you have superb monitoring for everything? • When it causes a HDD to fail? • When it finds two bugs in Go’s DNS library? No, when the unittests find a bug in your filesystem!
  • 45. Try it out! http://www.boxever.com/tag/monitoring has step-by-step instructions on monitoring: • Machines • Cassandra • HAProxy • Python Batch host-based jobs • Java applications Problems? We’re on #prometheus on Freenode