4. Why do I need monitoring at all?
• Know when stuff breaks (and act on it)
• Understand the performance characteristics of your applications
• To meet service level objectives
• To improve performance/reliability
5. Fundamentals: Push vs Pull
Monitoring Service
HTTP Server
Metrics Agent
success_event
error_event
...
Monitoring Service
HTTP Server
Metrics Agent
GET /metrics
req_total
req_latency
6. Whitebox Monitoring
shows internal service information
e.g. request_processing_time,
request_errors_total
Fundamentals: Blackbox vs. Whitebox Monitoring
Blackbox Monitoring
restricted to external service
behaviour
e.g. ping, http
Agent
HTTP Server
HTTP GET / 200 OK
Agent
HTTP Server
GET /metrics
errors_total
req_total
req_latency
7. Prometheus Project Overview
• OpenSource monitoring tool, originally built at Soundcloud
• Heavily inspired by Google’s Borgmon
• Written in Go (mainly)
• Very active community (150+ contributors for core Prometheus)
• Member of the Cloud Native Computing Foundation
11. Service Discovery
• Could live without SD in static environments
• In dynamic environments (e.g. Kubernetes) you must use SD
• Pods, Services,... come and go à impossible to statically configure
API Server
Pod 1
Pod 2
Pod n
...
12. Jobs, Targets and Exporters
source: https://prometheus.io/docs/introduction/overview/
19. Timeseries
• Tracks values of a metric over time (timestamp t, value v)
• Timestamps increase (strictly) monotonically
• 𝑡" < 𝑡"$% ∀ n ∈ ℕ
• Values can both increase or decrease
(t1,v1) (t2,v2) (t3,v3)
time
20. Metric Types
• Only relevant for Client libs (only untyped timeseries on the server)
• Counters: Values always increase
• requests_total, errors_total
• Gauges: Values can increase and decrease
• users_online, memory_free_bytes
• Histogram: Puts your measurements in buckets
• requests_latency_seconds_bucket
• Summary: Calculates percentiles over sliding time window
• requests_latency_seconds_summary
21. Labels
• A list of key/value pairs (𝑘% = 𝑣%, 𝑘. = 𝑣., … , 𝑘" = 𝑣")
• Labels partition a metric into timeseries
• So for every possible label combo in a given metric there will be a
timeseries is created
req_total{job=”job1", ver=”0.1”}: 10 à timeseries_1
req_total{job=”job1", ver=”0.2”}: 3 à timeseries_2
req_total{job=”job2", ver=”0.1”}: 4 à timeseries_3
req_total{job=”job1", ver=”0.1”}: 12 à timeseries_1
req_total{job=”job1"}: 1 à timeseries_4
23. PromQL
• Query language to select and aggregate timeseries
• Queries evaluate to either
• Instant Vector: Multiple timeseries with the same timestamp
• requests_total{job=”prom-boot”}
• Range Vector: Multiple timeseries with a range of timestamps
• rate(requests_total{job=“prom-boot”}[5m])
• Scalar: A single value
• sum(requests_total)
24. Alerting Rules
ALERT ErrorRateHigh
IF rate(errors_total[5m]) > 10
LABELS {
service = ”service_1",
severity = "warning"
}
ANNOTATIONS {
summary = "A high number of errors in service”,
description = "{{ $value }} errors have been registered
within the last hour for {{ $labels.instance }}"
}