2. What is Prometheus?
Community Driven Open-source Monitoring and
Alerting stack, ships a time series database, an
alerting entity and a number of integration tools to
expose metrics.
Made for dynamic cloud environments.
3. What is Prometheus made for?
• Instrumentation for applications and
systems
• Metrics collection and storage
• Querying, alerting, dashboarding
4. What is Prometheus not made for?
• Logging or tracing
• Automatic anomaly detection
• Scalable or durable storage
5. A brief but distinguished history
Started in 2012 as a SoundCloud
internal project
Second project to join CNCF after
Kubernetes
Prometheus v1.0.0 released in 2016
Prometheus v2.0.0 released in 2017
6. Core features
• Simplicity + efficiency
• Dimensional data model
• Powerful query language - PromQL
• Service discovery integration
8. Simplicity + efficiency:
• Local storage, no clustering
• 1 million+ samples/s
• Millions of series
• 1-2 bytes per sample
• HA by running two
• Go: static binary
9. Data model
https://prometheus.io/docs/concepts/data_model/
• Prometheus stores all data as time series: streams
of timestamped values belonging to the same metric
and the same set of labeled dimensions.
• Every time series is uniquely identified by its metric
name and optional key-value pairs called labels.
• The metric name specifies the general feature of a
system that is measured
10. Time series with labels
node_cpu_seconds_total{cpu="0",instance="demo.robustperception.io:9100",job="nod
e",mode="idle"} 14838327.84
metric name labels
• Flexible
• No hierarchy
• Explicit dimensions
17. Exporters/Clientlibs
The community has contributed
exporters and clientlibs for pretty
much everything
https://prometheus.io/docs/instrumenting/exporters/
https://prometheus.io/docs/instrumenting/clientlibs/
18. Dynamic Environments: new challenges!
• Dynamic VMs
• Cluster schedulers
• Microservices
• …many services, dynamic hosts, and
ports
23. Writing alerts: a real-world example
- alert: NodeCPUStuckInIOWait
annotations:
message: '{{ $labels.instance }} spent more than half its CPU time in
IOWait in the last 5 minutes'
doc: "This alert fires if CPU time in IOWait mode calculated on a 5
minutes window for a given instance was more than 50% in the last 15
minutes."
expr: |
rate(node_cpu_seconds_total{mode="iowait"}[5m]) > 0.5
for: 15m
labels:
severity: warning