Monitoring in a fast-changing world with Prometheus

Julien Pivotto
@roidelapluie
Monitoring in a fast-changing world with
Prometheus
October 2021

• Applications are short lived
• Updated often
• Infrastructure changes
(Nothing new...)
A fast changing world
@roidelapluie

• Monitoring an infrastructure
• Monitoring user experience
... together (dev&ops)
Monitoring in a "fast changing world"
@roidelapluie

CPU Usage, Disk space, Memory, Open file descriptors, ...
Infrastructure monitoring
@roidelapluie

Request Rate, Request Errors, Request Duration
Utilization, Saturation, Errors
User experience monitoring
@roidelapluie RED method by Tom Wilkie, USE method by Brendan Gregg

• High level overview of the state of a service/component
• Performance
• Availability
• Technical components
What is going on?
What is monitoring?
@roidelapluie

• Understand how your services behave
• Like you are at their place
• Without specific code
Why is this going on?
What's observability?
@roidelapluie

• Monitoring is required
• Some monitoring systems are design for observability
• If lucky, monitoring is enough
• Observability is removing luck
How do monitoring and observability connect?
@roidelapluie

Three pillars:
• Metrics
• Logs
• Traces
What's observability - in Practice?
@roidelapluie

Metrics
@roidelapluie https://play.grafana.org/

Traces
@roidelapluie https://www.jaegertracing.io/img/trace-detail-ss.png

• Culture: building together
• Automation
• Measurement
• Sharing
A DevOps world
@roidelapluie John Willis and Damon Edwards, 2010

• Open Source monitoring solution
• Graduated CNCF Project
• Born in 2012, publicly announced in 2015
• Collects metrics
• Plenty of integrations
• Service discoveries, like kubernetes.
• Easy to use query language
• Built-in alerting
Prometheus
@roidelapluie

• A community
• A server and many other components
• An ecosystem
What "Prometheus" means
@roidelapluie

• Open Source
• Pull-based Monitoring over HTTP
• Powerful query language
• Optimized TSDB
The Prometheus server
@roidelapluie

How does it work?
@roidelapluie

Prometheus scrapes metrics over HTTP.
caddy_http_requests_total{code="200",method="POST",path="/load"} 1
Dimensional data model, for filtering and aggregation.
Metrics and Labels
@roidelapluie

sum(rate(caddy_http_requests_total{code=~"5.."}[5m])
Gets the rate of all 5xx HTTP responses (server errors).
Querying metrics and Labels
@roidelapluie

sum(rate(caddy_http_requests_total{code=~"5.."}[5m]) without(code)
/
sum(rate(caddy_http_requests_total[5m]) without(code)
Gets the % of 5xx HTTP responses (server errors).
Querying metrics and Labels
@roidelapluie

• Metrics do not represent problems
• Metrics represent a state, give insights
• Metrics can be graphed
• You can alert based on them
Metrics and monitoring
@roidelapluie

In general you can just expose counters, and let the monitoring server do the
real maths.
That keeps the overhead very low of apps.
Exposed metrics are "raw"
@roidelapluie

• Prometheus server
• Alertmanager
• Exporters
Prometheus components
@roidelapluie

• Single binary
• No clustering
• No dependency on distributed FS
Prometheus server
@roidelapluie

• Single binary
• Clustering (raft protocol)
Alertmanager
@roidelapluie

Let's see what makes Prometheus play nicely with automation tools.
Automating Prometheus
@roidelapluie

• Works on your machine
• Container ready
• Not tied to kubernetes (see prometheus-operator)
Deploy anywhere
@roidelapluie

• Reloads on SIGHUP
• /-/reload endpoint (--web.enable-lifecycle)
• Working to have less and less overhead on reloads
Reloading Prometheus
@roidelapluie

- template:
source: prometheus.yml.j2
target: /etc/prometheus/prometheus.yml
validate: /usr/bin/promtool check config %s
Also: check rules, check web-config.
Ansible
@roidelapluie

Plenty of situation do not require a reload of Prometheus:
• Password files
• TLS certificates
Prometheus will read them before use, no reload needed!
Not reloading Prometheus
@roidelapluie

HashiCorp Vault enables retrieving temporary secrets and writing them to a file.
./vault agent -config vault-agent.hcl
Using vault
@roidelapluie https://inuits.eu/blog/prometheus-consul-vault-228/

scrape_configs:
- job_name: consul-services
consul_sd_configs:
- server: localhost:8500
authorization:
credentials_file: consul_token
Reading token from vault
@roidelapluie

Prometheus offers native TLS and basic auth.
tls_server_config:
cert_file: server.crt
key_file: server.key
basic_auth_users:
alice: $2y$10$mDwo.lAisC94iLAyP8
bob: $2y$10$hLqFl9jSjoAAy95Z/zw8
TLS and basic auth
@roidelapluie

The "web-config" file is read on every request:
• No need to reload
• Instantly change passwords, cert files
Shared config format between Prometheus and exporters!
TLS and basic auth
@roidelapluie

Prometheus has a snapshot API.
Enable with --web.enable-admin-api
curl -d{} http://localhost:9090/api/v1/admin/tsdb/snapshot
Prometheus TSDB is made of immutable blocks. Snapshots use hard links.
Backups
@roidelapluie

Service Discovery
@roidelapluie

• Easier to know what's down with Pull
• Easy debugging (curl)
• Easier to spread the load
• Central configuration point
• High Availability
Prometheus pull model
@roidelapluie

• Prometheus must know what to pull
• Source of Truth
• Service Discovery != Auto discovery
• Event based when possible
Service Discovery
@roidelapluie

• Kubernetes
• Consul
• Cloud providers (Azure, AWS, GCP, DigitalOcean, Hetzner, Scaleway, Linode)
• Docker & Docker Swarm
• And more! 20+ external SD in total.
Sources of Truth
@roidelapluie

• Static SD: into Prometheus main config
• File SD: Files on disk
• HTTP SD: HTTP endpoints
Generic Service Discovery
@roidelapluie https://inuits.eu/blog/prometheus-http-service-discovery/

[
{
"targets": ["10.0.10.2:9100"],
"labels": {
"__meta_datacenter": "london"
}
}
]
Generic SD format (file & http SD)
@roidelapluie

• Both integrate your own SD systems into prometheus
• File SD is event based (inotify)
• HTTP SD can be integrated in your apps
File SD vs HTTP SD
@roidelapluie

Labels can be used to configure targets.
• __address__: 127.0.0.1:9090
• __metrics_path__: /metrics
• __scheme__: http or https
• __param_<name>: http parameter
• __scrape_interval__, __scrape_timeout__: 1m
Labels
@roidelapluie

Additionally, extra labels are added by SD.
• __meta_kubernetes_pod_label_app
• __meta_digitalocean_region
• __meta_linode_public_ipv6
• __meta_scaleway_instance_status
Meta labels
@roidelapluie https://prometheus.io/docs/prometheus/latest/configuration/configuration/

A fundamental principle in Prometheus.
Transform input labels into a new set of labels.
Relabeling
@roidelapluie

• Rename, merge, replace labels
• Conditionally drop label sets
• Only keep labels sets
Relabeling actions
@roidelapluie

• Get lots of labels as input
• Turns them into targets
• Remove labels prefixed with __
• Can use "special labels"
Target relabeling
@roidelapluie

- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://prometheus.io
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115
Target relabeling example
@roidelapluie

https://relabeler.promlabs.com/
Relabeler
@roidelapluie

puppetdb_sd_configs:
- url: http://127.0.0.1:8080/
query: 'resources { type = "Apache::Vhost" }'
include_parameters: true
relabel_configs:
- source_labels: [__meta_puppetdb_parameter_servername]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115
Relabel from SD
@roidelapluie

• Simple to deploy
• Behaves as you expect
• Easy to reload
Prometheus is automation Friendly
@roidelapluie

• Easy password/cert rotation
• Service discovery to keep up to date infra
Prometheus is change friendly
@roidelapluie

https://prometheus.io/community
Prometheus is open, join us!
@roidelapluie

Julien Pivotto
@roidelapluie
roidelapluie@inuits.eu
Essensteenweg 31
2930 Brasschaat
Belgium
Contact:
info@inuits.eu
+32-3-8082105

Monitoring in a fast-changing world with Prometheus

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Monitoring in a fast-changing world with Prometheus

Ähnlich wie Monitoring in a fast-changing world with Prometheus (20)

Mehr von Julien Pivotto

Mehr von Julien Pivotto (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Monitoring in a fast-changing world with Prometheus