Note that provided environments will not be available outside the workshop - you can follow instructions from https://github.com/PierreVincent/prometheus-workshop to run the environment yourself.
In the world of cloud native and distributed applications, Prometheus has quickly risen to be one of the leading open-source monitoring tools. In this workshop, you will get to learn as much as possible to get you started with Prometheus for monitoring a service-oriented architecture.
We will cover:
- The core concepts of Prometheus
- Instrumenting your code to expose metrics
- Querying Prometheus to gain insights on how your applications behave
- Defining rules to trigger alerts based on metrics and thresholds
- Building Grafana dashboards combining multiple metrics
8. @PierreVincent
Scraping for samples
User
Service
/metrics
# HELP http_requests_total Total number of http requests
by response status code
# TYPE http_requests_total counter
http_requests_total{endpoint="/login",status="200"} 1584
http_requests_total{endpoint="/login",status="500"} 9
...
metric
http_requests_total
labels
endpoint=/login
status=200
timestamp
1519205931
value
1584
tsdb
Each value
results in a
sample
Every scrape
interval
Persist
11. @PierreVincent
Exercises 1 - Counters & Rates
● What's the overall request rate (with a 1 minute rolling-window) for the http-
simulator service?
● How many requests per minute are errors?
● What's the error rate (in %) of requests to the /users endpoint?
sum(rate(http_requests_total{app="http-simulator"}[1m]))
60*sum(rate(http_requests_total{app="http-simulator",
status="500"}[1m]))
100 * sum(rate(http_requests_total{app="http-simulator",
endpoint="/users", status="500"}[1m])) /
sum(rate(http_requests_total{app="http-simulator",
endpoint="/users"}[1m]))
12. @PierreVincent
Exercises 2 - Latency distribution
● What is the median latency of all requests to the http-simulator service?
● Does the /users endpoint fulfill the SLO of 3 Nines requests responding within
400ms?
histogram_quantile(0.5,rate(http_request_duration_milliseconds_
bucket{app="http-simulator"}[5m]))
sum(http_request_duration_milliseconds_bucket{app="http-
simulator", status="200", endpoint="/users", le="400"}) /
sum(http_request_duration_milliseconds_count{app="http-
simulator", status="200", endpoint="/users"})
13. @PierreVincent
Exercises 3 - Grafana widgets
Some examples of widgets (or come up with your own ones):
● Graph of latency distribution
● Cumulative % graph of endpoint request rate
● Memory usage over time
● CPU usage over time
● Graph % of requests fulfilling the SLO of 400ms for /login endpoint
● ...