SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
The hitchhiker’s guide to
Remco Overdijk
1
"A Metric, The Hitchhiker's Guide to Prometheus says, is
about the most massively useful thing someone doing
Monitoring can have. It has great practical value. You can
wave your Metric in emergencies as a distress signal, and
produce pretty Graphs at the same time."
1. The Landscape
What are we running and why?
2. Core Concepts
How does Prometheus work?
3. Demo Time!
It’s a Tools in Action talk after all, right?
4. Tips & Tricks
Getting the most out of your Prometheus Experience
5. Questions?
I’m probably going to answer “42” to most of them..
So many things to tell, so little time..
2
The Hitchhiker’s Guide to Prometheus
• Started out in TES, doing Metrics, Monitoring & Logging.
(Graphite, Statsd, Grafana, Nagios, Logstash, ElasticSearch, Kibana, etc. )
• Currently in DPI, doing CI/CD and bringing Gitlab/Spinnaker to the Cloud.
That requires a lot of monitoring…
• Member of the Cloud9 MML Circle, doing Prometheus
• Core Contributor to the R2D2 module that manages Prometheus and Monitoring/Alerting resources
within Cloud9
• Worked on implementing Prometheus and Grafana, while also using these stacks for monitoring
production systems.
• NightOwl for SRT Platform; I know how pagers work.
Who are you, and why are you telling us this?
3
Introduction
The Landscape
What are we running?
Data Center VS Cloud
VM’s and Servers VS containers in Kubernetes
5
Monitoring Prometheus
Metrics Prometheus (+
InfluxDB/Thanos)
Alerting AlertManager, Iris,
OnCall, Grafana
Visualization Grafana
Logging StackDriver,
ElasticSearch + Kibana
Monitoring Nagios + Thruk +
Lookingglass
Metrics Graphite + Statsd
Alerting SMS modems in
physical servers
Visualization Grafana
Logging ElasticSearch + Kibana
•Applications in Kubernetes are much more dynamic than we’re used to.
• No Static IP addresses.
• No Static amount servers (Well, pods actually..)
• Kubernetes can reschedule / relocate pods at will.
• Prometheus uses Service Discovery to find targets
•Both Nagios and Graphite have scaling issues and are too rigid.
• Prometheus is Pull instead of Push based and doesn’t require execution for every single check
• Combines Metrics & Monitoring into a single stack, but focuses on Monitoring.
•Being based on BorgMon, it works out of the box with a lot of Kubernetes /
Cloud native components and the services supporting them.
•StackDriver is not a full fledged alternative due to features, retention and cost.
Why didn’t you come up with something else?
6
So, why Prometheus?
•Out of the box, Prometheus also doesn’t scale endlessly without compromises
(But Thanos will)
•Scalability is solved through retention, manual sharding and vertical scaling,
which all have clear drawbacks.
•HA is solved through duplication (Polling twice from independent instances
with individual TSDB’s).
•Prometheus development is very focused, which shows in certain aspects.
Well.. No.
7
Is this the answer to everything then?
All the pods & services
8
Infrastructure Overview
Kubernetes {DEV, STG, PRO} Clusters
Datacenters
Prometheus
Prometheus
AlertManager
AlertManager
AlertManager
Grafana
PushGateway
IRIS
OnCall
SMS / Call
Provider
HipChat
Operator
Remote
Storage
Adapter
InfluxDB
YOUR App!
Kubernetes
Exporters
Core Concepts
How does it work and what makes it tick?
- Counters
- A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only
increase or be reset to zero on restart. (1, 2, 5, 9, 0, 2, 7)
- Gauges
- A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.
(1, 4, 2, 5, 8)
- Histograms
- A histogram samples observations (usually things like request durations or response sizes) and counts them in
configurable buckets. It also provides a sum of all observed values.
- Summaries
- Similar to a histogram, a summary samples observations (usually things like request durations and response
sizes). While it also provides a total count of observations and a sum of all observed values, it calculates
configurable quantiles over a sliding time window.
- Quantiles are convenient when (for example) expressing median (2-quantile) and 95th percentiles.
Supported Types
10
Making Metrics
- Instead of creating separate checks for every metric that should be monitored for your
application, you expose a single (or multiple..) HTTP Endpoint containing all metrics.
- It’s your responsibility to make this endpoint Available, Fast and Reliable.
- Multiple Frameworks and Libraries can help you provisioning and maintaining such an
endpoint.
- Axle Comes with built-in support for MicroMeter, which does everything for you.
- Backspin support is coming soon™.
- Example: http://localhost:30000/metrics
The concept of Scraping HTTP Metric Endpoints
11
Exposing Metrics: Push VS Pull
# HELP prometheus_tsdb_head_min_time Minimum time bound of the head block.
# TYPE prometheus_tsdb_head_min_time gauge
prometheus_tsdb_head_min_time 1.5282792e+12
# HELP prometheus_tsdb_head_samples_appended_total Total number of appended samples.
# TYPE prometheus_tsdb_head_samples_appended_total counter
prometheus_tsdb_head_samples_appended_total 2.9485092e+07
# HELP prometheus_tsdb_head_series Total number of series in the head block.
# TYPE prometheus_tsdb_head_series gauge
prometheus_tsdb_head_series 19956
# HELP prometheus_tsdb_head_series_created_total Total number of series created in the head
# TYPE prometheus_tsdb_head_series_created_total gauge
prometheus_tsdb_head_series_created_total 56888
- An actual Query Language that looks a lot more like SQL than Graphite.
- You’ll need to learn a new language, but it’s only a single language for creating Graphs and Alerts; for
monitoring and long term metrics.
- Allows for a lot of flexibility, but can be a bit harder to grasp when starting out.
- Supports functions, operators, regex, arithmetic and expressions.
- Four expression types are supported:
- Instant Vectors (like http_requests_total{environment=~"staging|testing|development", method!="GET"})
- Instant vector selectors allow the selection of a set of time series and a single sample value for each at a given timestamp
(instant): in the simplest form, only a metric name is specified. This results in an instant vector containing elements for all time
series that have this metric name.
- Range Vectors (like http_requests_total{job="prometheus"}[5m] )
- Range vector literals work like instant vector literals, except that they select a range of samples back from the current instant.
Syntactically, a range duration is appended in square brackets ([]) at the end of a vector selector to specify how far back in time
values should be fetched for each resulting range vector element.
- Scalars
- Strings
PromQL
12
Querying Metrics
- Custom Resource Type provided by Prometheus-operator
- Abstraction of Prometheus “job” and Service Discovery
- Allows for easy ingestion of new endpoints through their k8s service
- Example:
ServiceMonitors
13
Getting your endpoint monitored
Prometheus
Prometheus OperatorYOUR App! K8s Service ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
spec:
endpoints:
- bearerTokenFile:
/var/run/secrets/kubernetes.io/serviceaccount/token
interval: 30s
port: https
scheme: https
tlsConfig:
insecureSkipVerify: true
jobLabel: k8s-app
selector:
matchLabels:
k8s-app: node-exporter
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: node-exporter
name: node-exporter
spec:
ports:
- name: https
port: 9100
protocol: TCP
targetPort: https
selector:
app: node-exporter
type: ClusterIP
- The same tool you were probably already using.
- The central interface for cloud insights
- Contains a specialized query editor for Prometheus data sources.
- Prometheus currently doesn’t store metrics older than one month for performance reasons.
- Multiple solutions for long term metrics exist, but it’s a work in progress.
Dashboarding with Grafana
14
Creating Insights
Prometheus
Prometheus Grafana
HipChat
Remote
Storage
Adapter
InfluxDB
Trouble in Paradise
Creating Alerts, choosing your weapon
15
WARNINGS – Notifications During workhours
- No direct intervention is required
- Usually picked up by members of the team
developing / maintaining a system.
- Alert delivery is NOT guaranteed.
Use Grafana with HipChat or Email alerts
CRITICALS – 24x7 Text Messages with Escalation
- Actionable events that require immediate attention
by an Engineer on Duty, who does not necessarily
have intimate knowledge of your system.
- Response is required to silence/end the alert.
- Provisioned through RuleList (R2D2 / Operator)
Use AlertManager / Iris / Oncall
Yes, It’s PromQL as well!
16
Alert Basics
%YAML 1.1
---
kind: PrometheusAlertRule
Data:
test.rules: |
Groups:
- name: Load
interval: 30s
Rules:
- alert: HighLoad
expr: rate(web_http_responses_total[1m]) > 1
for: 1m
Labels:
Severity: attention
Annotations:
description: The rate of HTTP requests is too high.
- Alerts should be actionable: Somebody has to do something, now.
- They should be simple: Someone without intimate knowledge of the system should ideally be
able to solve the alert.
- They should be urgent and require human intervention: No point in waking someone up if they
shouldn’t have to do something, or when tomorrow afternoon would be soon enough.
- Provide accurate descriptions and a playbook where possible.
- Basic system monitoring should be based on SLI/SLO’s rather than infra metrics.
- Prefer AM/Iris/OnCall if you’re serious about your alert.
Creating the perfect alert
17
Alert Perfection
Prometheus
AlertManager
AlertManager
AlertManager
Grafana
IRIS OnCall
SMS / Call
Provider
HipChat
• A long list of exporters is available at https://prometheus.io/docs/instrumenting/exporters/
• A number of these come preconfigured with our Kubernetes clusters and provide additional metrics
When artisanal endpoints don’t cut the cake
18
Exporters - Additional sources of metrics
Databases
Aerospike exporter
ClickHouse exporter
Consul exporter (official)
CouchDB exporter
ElasticSearch exporter
Memcached exporter (official)
MongoDB exporter
MSSQL server exporter
MySQL server exporter (official)
OpenTSDB Exporter
Oracle DB Exporter
PgBouncer exporter
PostgreSQL exporter
ProxySQL exporter
RavenDB exporter
Redis exporter
RethinkDB exporter
SQL exporter
Tarantool metric library
Hardware related
apcupsd exporter
Collins exporter
IoT Edison exporter
IPMI exporter
knxd exporter
Node/system metrics exporter (official)
Ubiquiti UniFi exporter
Messaging systems
Beanstalkd exporter
Gearman exporter
Kafka exporter
NATS exporter
NSQ exporter
Mirth Connect exporter
MQTT blackbox exporter
RabbitMQ exporter
RabbitMQ Management Plugin exporter
Storage
Ceph exporter
Ceph RADOSGW exporter
Gluster exporter
Hadoop HDFS FSImage exporter
Lustre exporter
ScaleIO exporter
HTTP
Apache exporter
HAProxy exporter (official)
Nginx metric library
Nginx VTS exporter
Passenger exporter
Tinyproxy exporter
Varnish exporter
WebDriver exporter
APIs
AWS ECS exporter
AWS Health exporter
AWS SQS exporter
Cloudflare exporter
DigitalOcean exporter
Docker Cloud exporter
Docker Hub exporter
GitHub exporter
InstaClustr exporter
Mozilla Observatory exporter
OpenWeatherMap exporter
Pagespeed exporter
Rancher exporter
Speedtest exporter
Logging
Fluentd exporter
Google's mtail log data extractor
Grok exporter
Other monitoring systems
Akamai Cloudmonitor exporter
AWS CloudWatch exporter (official)
Cloud Foundry Firehose exporter
Collectd exporter (official)
Google Stackdriver exporter
Graphite exporter (official)
Heka dashboard exporter
Heka exporter
InfluxDB exporter (official)
JavaMelody exporter
JMX exporter (official)
Munin exporter
Nagios / Naemon exporter
New Relic exporter
NRPE exporter
Osquery exporter
Pingdom exporter
scollector exporter
Sensu exporter
SNMP exporter (official)
StatsD exporter (official)
Miscellaneous
Bamboo exporter
BIG-IP exporter
BIND exporter
Bitbucket exporter
Blackbox exporter (official)
BOSH exporter
cAdvisor
Confluence exporter
Dovecot exporter
eBPF exporter
Jenkins exporter
JIRA exporter
Kannel exporter
Kemp LoadBalancer exporter
Meteor JS web framework exporter
Minecraft exporter module
PHP-FPM exporter
PowerDNS exporter
Process exporter
rTorrent exporter
SABnzbd exporter
Script exporter
Shield exporter
SMTP/Maildir MDA blackbox prober
SoftEther exporter
Transmission exporter
Unbound exporter
Xen exporter
• StackDriver Exporter- Get your GCP Project’s native metrics into Prometheus.
• Blackbox Exporter – Monitor Golden Signals on any system, without knowledge about the inner working
• Nginx exporter – used in Ingresses
• SNMP Exporter – Bring your own MIB’s.
• Statsd Exporter – Push your statsd metrics to a sidecar container
• Node Exporter – Provides system metrics for VM and Physical systems (like kubernetes nodes)
• cAdvisor – Get generic container metrics
• Etcd
• Kubernetes
• Minio (Gitlab Runner Caching)
The most commonly used
19
Exporters - Highlights
Prometheus
Prometheus OperatorExporter K8s Service ServiceMonitor
• For situations where you are unable to serve a HTTP metrics page for a reliable period of time.
• Ideal for short running tasks like Kubernetes CronJobs, Hadoop Jobs, Scripts, etc.
• Allows you to Push (through a HTTP call) Metrics to buffering service, which in turn exposes them to
Prometheus.
• Metrics will live forever on the Gateway, so be careful of what you push and how you name them.
• Avoid this route if possible, since it scales very badly and is NOT redundant. Bring your own endpoint if
and when possible.
• PRO-Tip: If you have an ephemeral job, also push the timestamp of last successful job completion.
The Push Gateway
20
Metrics for ephemeral jobs
Prometheus
PrometheusYOUR App! Push Gateway
echo ”ultimate_answer 42.0" | curl --data-binary @- http://gateway:9091/metrics/job/magrathea/instance/zaphod-001/group/vogon/opex/DPI
ultimate_answer{group=”vogon",instance=”zaphod-001",job=”magrathea",opex=”DPI"} 42.0
Demo Time!
• Kubernetes Running on Docker for macOS.
• Out of the box Prometheus on Kubernetes from https://github.com/coreos/prometheus-
operator/tree/master/contrib/kube-prometheus
• Services are running without an Ingress, so we’re accessing them directly, using NodePorts.
• We’re going to add our own Full Featured Axle Service by creating a Deployment and a Service to match
it, adding a ServiceMonitor, watching Service Discovery do it’s thing, graphing one of the metrics and
creating an alert for it.
• Prometheus: http://localhost:30000/graph
• AlertManager: http://localhost:31000/#/alerts
• Grafana: http://localhost:32000/d/9dP_FHImz/pods
Getting started in 5 minutes
22
Today’s Quick Demo
Tips & Tricks
Getting the most out of your Prometheus Experience
• Metrics in Prometheus are multi dimensional; They consist of names and labels.
• Names are generic identifiers to tell WHAT you are measuring, in what format.
• Metric Names SHOULD have a single (base!) unit, added as a suffix describing that unit. (bytes, seconds,
meters)
• Labels describe characteristics, and are usually used to identify WHERE those metrics are coming from,
and can be multi faceted.
• Prometheus saves a separate Time Series for each name/labels combination, so you have to ensure
label cardinality does not get too high, or you will kill Prometheus in the end. (Bad examples: usernames,
internet IP addresses, hashes).
• Read https://prometheus.io/docs/practices/naming/ before you start making your own!
Keep things running smoothly by not making a mess.
24
Metric Naming
api_http_requests_total { type="create|update|delete”, method=“GET|POST|DELETE” }
api_request_duration_seconds { stage="extract|transform|load” }
api_errors_total { endpoint=“listProducts|updatePricing”, code=“500|404|418 I'm a teapot” }
•An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of
the level of service that is provided.
•An SLO is a service level objective: a target value or range of values for a service level that is
measured by an SLI. A natural structure for SLOs is thus
[SLI ≤ target], or [lower bound ≤ SLI ≤ upper bound].
•Symptoms vs Causes: Monitor things that users will notice when using your system.
•Latency - The time it takes to service a request.
•Traffic. - A measure of how much demand is being placed on your system, measured in a
high-level system-specific metric. For a web service, this measurement is usually HTTP
requests per second.
•Errors - The rate of requests that fail (like HTTP 500’s)
•Saturation- "How "full" your service is. A measure of your system fraction, emphasizing the
resources that are most constrained.
What should you be monitoring?
25
The Golden Signals
•BlackBox Exporter for period requests and their Metrics (Success, Latency, Errors)
•Nginx Ingress Metrics for a man-in-the-middle view of your application (Flow, Latency, Errors)
•Your own application’s Metrics for insights, details and under-the-hood view.
Combining Metric Sources for an unbiassed view
26
Bringing it all together
Your App
Blackbox
Exporter
Ingress
Poll Metrics
Ingress Metrics
App Metrics
- job_name: 'blackbox’
metrics_path: /probe
params:
module: [http_2xx] # Look for a HTTP 200 response.
static_configs:
- targets:
- http://myapp.behindingress.io # Target to probe with http
Prometheus scrape
•Introducing the GenericServiceMonitor and DCServiceMonitor
•These types allow you to define endpoints outside of Kubernetes, and allow
you to monitor on-premise services.
•DCServiceMonitor works based on bol_applications and as such is bol.com
specific:
•GenericServiceMonitor works on static endpoints
My stuff runs in the DC and I want to keep it there.
27
So what about non-Cloud resources?
kind: Prometheus/DCServiceMonitor
name: tst-sdd-app
spec:
port: 8080
path: /internal/metrics
kind: Prometheus/GenericServiceMonitor
name: dev-atscale-app
Spec:
hosts: - ip: 1.2.3.4
hostname: some.host.name
port: 8080
path: /internal/metrics
opex: srt-bificsps
•Always initialize your metrics at zero when possible, or you won’t know the significance of the
first value.
•How do you know if your application is OK when the metrics stopped working? The up metric
might also disappear when Service Discovery no longer detects your service. Always use
absent() to check for existence of up!
•(i)rate()/increase() then sum(), not sum() then (i)rate()/increase(), since those
are the only safe functions to deal with resets.
•The rate function takes a time series over a time range, and based on the first and last data
points within that range (http://localhost:32000/d/h3RZO2Iik/rate-vs-irate?orgId=1 )
•By contrast irate is an instant rate. It only looks at the last two points within the
range passed to it and calculates a per-second rate.
•To complement the saturation signal; Prometheus has predict_linear() for Gauges.
•All the metrics? http://localhost:30000/federate?match[]={__name__%3D~%22[a-z].*%22}
Things you’ll encounter once you start making queries
28
Other tips
Questions?
Don’t bother to ask me the Ultimate Question of Life, the
Universe and Everything, because you already know the answer.
(and yes, I know where my towel is.)
Remco Overdijk
roverdijk@bol.com
So Long!
And thanks for all the fish.

Weitere ähnliche Inhalte

Was ist angesagt?

Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Brian Brazil
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time ComputationSonal Raj
 
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Brian Brazil
 
Gatling workshop lets test17
Gatling workshop lets test17Gatling workshop lets test17
Gatling workshop lets test17Gerald Muecke
 
Velocity 2015 building self healing systems (slide share version)
Velocity 2015 building self healing systems (slide share version)Velocity 2015 building self healing systems (slide share version)
Velocity 2015 building self healing systems (slide share version)SOASTA
 
Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Brian Brazil
 
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Brian Brazil
 
Synchronization problem with threads
Synchronization problem with threadsSynchronization problem with threads
Synchronization problem with threadsSyed Zaid Irshad
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)Brian Brazil
 
Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Brian Brazil
 
GR8Conf 2011: Tuning Grails Applications by Peter Ledbrook
GR8Conf 2011: Tuning Grails Applications by Peter LedbrookGR8Conf 2011: Tuning Grails Applications by Peter Ledbrook
GR8Conf 2011: Tuning Grails Applications by Peter LedbrookGR8Conf
 
Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Dieter Plaetinck
 

Was ist angesagt? (19)

Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
 
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
 
Storm
StormStorm
Storm
 
Gatling workshop lets test17
Gatling workshop lets test17Gatling workshop lets test17
Gatling workshop lets test17
 
Velocity 2015 building self healing systems (slide share version)
Velocity 2015 building self healing systems (slide share version)Velocity 2015 building self healing systems (slide share version)
Velocity 2015 building self healing systems (slide share version)
 
Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)Provisioning and Capacity Planning (Travel Meets Big Data)
Provisioning and Capacity Planning (Travel Meets Big Data)
 
Storm
StormStorm
Storm
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
 
Synchronization problem with threads
Synchronization problem with threadsSynchronization problem with threads
Synchronization problem with threads
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
Java performance tuning
Java performance tuningJava performance tuning
Java performance tuning
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
 
GR8Conf 2011: Tuning Grails Applications by Peter Ledbrook
GR8Conf 2011: Tuning Grails Applications by Peter LedbrookGR8Conf 2011: Tuning Grails Applications by Peter Ledbrook
GR8Conf 2011: Tuning Grails Applications by Peter Ledbrook
 
Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016
 

Ähnlich wie The hitchhiker’s guide to Prometheus

Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Brian Brazil
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Brian Brazil
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Brian Brazil
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemAccumulo Summit
 
Monitoring with prometheus at scale
Monitoring with prometheus at scaleMonitoring with prometheus at scale
Monitoring with prometheus at scaleJuraj Hantak
 
Monitoring with prometheus at scale
Monitoring with prometheus at scaleMonitoring with prometheus at scale
Monitoring with prometheus at scaleAdam Hamsik
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Brian Brazil
 
Distributed tracing 101
Distributed tracing 101Distributed tracing 101
Distributed tracing 101Itiel Shwartz
 
Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)Eran Levy
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaSridhar Kumar N
 
Slack in the Age of Prometheus
Slack in the Age of PrometheusSlack in the Age of Prometheus
Slack in the Age of PrometheusGeorge Luong
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaArvind Kumar G.S
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus OverviewBrian Brazil
 
Monitoring Weave Cloud with Prometheus
Monitoring Weave Cloud with PrometheusMonitoring Weave Cloud with Prometheus
Monitoring Weave Cloud with PrometheusWeaveworks
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsBrendan Gregg
 

Ähnlich wie The hitchhiker’s guide to Prometheus (20)

Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
 
Monitoring with prometheus at scale
Monitoring with prometheus at scaleMonitoring with prometheus at scale
Monitoring with prometheus at scale
 
Monitoring with prometheus at scale
Monitoring with prometheus at scaleMonitoring with prometheus at scale
Monitoring with prometheus at scale
 
Distributed Tracing
Distributed TracingDistributed Tracing
Distributed Tracing
 
RxJava@Android
RxJava@AndroidRxJava@Android
RxJava@Android
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
 
Mini training - Reactive Extensions (Rx)
Mini training - Reactive Extensions (Rx)Mini training - Reactive Extensions (Rx)
Mini training - Reactive Extensions (Rx)
 
Distributed tracing 101
Distributed tracing 101Distributed tracing 101
Distributed tracing 101
 
Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
 
Slack in the Age of Prometheus
Slack in the Age of PrometheusSlack in the Age of Prometheus
Slack in the Age of Prometheus
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
 
Monitoring Weave Cloud with Prometheus
Monitoring Weave Cloud with PrometheusMonitoring Weave Cloud with Prometheus
Monitoring Weave Cloud with Prometheus
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
 

Mehr von Bol.com Techlab

The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusBol.com Techlab
 
The Reactive Rollercoaster
The Reactive RollercoasterThe Reactive Rollercoaster
The Reactive RollercoasterBol.com Techlab
 
Best painkiller for Java headache
Best painkiller for Java headacheBest painkiller for Java headache
Best painkiller for Java headacheBol.com Techlab
 
Organizing a conference in 80 days
Organizing a conference in 80 daysOrganizing a conference in 80 days
Organizing a conference in 80 daysBol.com Techlab
 
Three steps to untangle data traffic jams
Three steps to untangle data traffic jamsThree steps to untangle data traffic jams
Three steps to untangle data traffic jamsBol.com Techlab
 
Understanding Operating Systems by breaking them
Understanding Operating Systems by breaking themUnderstanding Operating Systems by breaking them
Understanding Operating Systems by breaking themBol.com Techlab
 
How to train your dragon
How to train your dragonHow to train your dragon
How to train your dragonBol.com Techlab
 
Software for drafting a cold beer
Software for drafting a cold beerSoftware for drafting a cold beer
Software for drafting a cold beerBol.com Techlab
 
Going to the cloud: Forget EVERYTHING you know!
Going to the cloud: Forget EVERYTHING you know!Going to the cloud: Forget EVERYTHING you know!
Going to the cloud: Forget EVERYTHING you know!Bol.com Techlab
 
How to create your presentation in an iterative way
How to create your presentation in an iterative wayHow to create your presentation in an iterative way
How to create your presentation in an iterative wayBol.com Techlab
 
Jupyter and Pandas to the rescue!
Jupyter and Pandas to the rescue!Jupyter and Pandas to the rescue!
Jupyter and Pandas to the rescue!Bol.com Techlab
 
How the best of Design and Development come together
How the best of Design and Development come togetherHow the best of Design and Development come together
How the best of Design and Development come togetherBol.com Techlab
 
The addition to your team you never knew you needed
The addition to your team you never knew you neededThe addition to your team you never knew you needed
The addition to your team you never knew you neededBol.com Techlab
 
Gravitational waves: A new era in astronomy
Gravitational waves: A new era in astronomyGravitational waves: A new era in astronomy
Gravitational waves: A new era in astronomyBol.com Techlab
 
Consumer Driven Contract Testing
Consumer Driven Contract TestingConsumer Driven Contract Testing
Consumer Driven Contract TestingBol.com Techlab
 
I want to go fast! - Exposing performance bottlenecks
I want to go fast! - Exposing performance bottlenecksI want to go fast! - Exposing performance bottlenecks
I want to go fast! - Exposing performance bottlenecksBol.com Techlab
 
Kubernetes: love at first sight?
Kubernetes: love at first sight?Kubernetes: love at first sight?
Kubernetes: love at first sight?Bol.com Techlab
 
Blockchain: the magical database in the cloud?
Blockchain: the magical database in the cloud?Blockchain: the magical database in the cloud?
Blockchain: the magical database in the cloud?Bol.com Techlab
 

Mehr von Bol.com Techlab (20)

The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to Prometheus
 
Test long and prosper
Test long and prosperTest long and prosper
Test long and prosper
 
The Reactive Rollercoaster
The Reactive RollercoasterThe Reactive Rollercoaster
The Reactive Rollercoaster
 
Best painkiller for Java headache
Best painkiller for Java headacheBest painkiller for Java headache
Best painkiller for Java headache
 
Organizing a conference in 80 days
Organizing a conference in 80 daysOrganizing a conference in 80 days
Organizing a conference in 80 days
 
Three steps to untangle data traffic jams
Three steps to untangle data traffic jamsThree steps to untangle data traffic jams
Three steps to untangle data traffic jams
 
Understanding Operating Systems by breaking them
Understanding Operating Systems by breaking themUnderstanding Operating Systems by breaking them
Understanding Operating Systems by breaking them
 
How to train your dragon
How to train your dragonHow to train your dragon
How to train your dragon
 
Software for drafting a cold beer
Software for drafting a cold beerSoftware for drafting a cold beer
Software for drafting a cold beer
 
Going to the cloud: Forget EVERYTHING you know!
Going to the cloud: Forget EVERYTHING you know!Going to the cloud: Forget EVERYTHING you know!
Going to the cloud: Forget EVERYTHING you know!
 
How to create your presentation in an iterative way
How to create your presentation in an iterative wayHow to create your presentation in an iterative way
How to create your presentation in an iterative way
 
Wax on, wax off
Wax on, wax offWax on, wax off
Wax on, wax off
 
Jupyter and Pandas to the rescue!
Jupyter and Pandas to the rescue!Jupyter and Pandas to the rescue!
Jupyter and Pandas to the rescue!
 
How the best of Design and Development come together
How the best of Design and Development come togetherHow the best of Design and Development come together
How the best of Design and Development come together
 
The addition to your team you never knew you needed
The addition to your team you never knew you neededThe addition to your team you never knew you needed
The addition to your team you never knew you needed
 
Gravitational waves: A new era in astronomy
Gravitational waves: A new era in astronomyGravitational waves: A new era in astronomy
Gravitational waves: A new era in astronomy
 
Consumer Driven Contract Testing
Consumer Driven Contract TestingConsumer Driven Contract Testing
Consumer Driven Contract Testing
 
I want to go fast! - Exposing performance bottlenecks
I want to go fast! - Exposing performance bottlenecksI want to go fast! - Exposing performance bottlenecks
I want to go fast! - Exposing performance bottlenecks
 
Kubernetes: love at first sight?
Kubernetes: love at first sight?Kubernetes: love at first sight?
Kubernetes: love at first sight?
 
Blockchain: the magical database in the cloud?
Blockchain: the magical database in the cloud?Blockchain: the magical database in the cloud?
Blockchain: the magical database in the cloud?
 

Kürzlich hochgeladen

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Kürzlich hochgeladen (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

The hitchhiker’s guide to Prometheus

  • 1. The hitchhiker’s guide to Remco Overdijk 1 "A Metric, The Hitchhiker's Guide to Prometheus says, is about the most massively useful thing someone doing Monitoring can have. It has great practical value. You can wave your Metric in emergencies as a distress signal, and produce pretty Graphs at the same time."
  • 2. 1. The Landscape What are we running and why? 2. Core Concepts How does Prometheus work? 3. Demo Time! It’s a Tools in Action talk after all, right? 4. Tips & Tricks Getting the most out of your Prometheus Experience 5. Questions? I’m probably going to answer “42” to most of them.. So many things to tell, so little time.. 2 The Hitchhiker’s Guide to Prometheus
  • 3. • Started out in TES, doing Metrics, Monitoring & Logging. (Graphite, Statsd, Grafana, Nagios, Logstash, ElasticSearch, Kibana, etc. ) • Currently in DPI, doing CI/CD and bringing Gitlab/Spinnaker to the Cloud. That requires a lot of monitoring… • Member of the Cloud9 MML Circle, doing Prometheus • Core Contributor to the R2D2 module that manages Prometheus and Monitoring/Alerting resources within Cloud9 • Worked on implementing Prometheus and Grafana, while also using these stacks for monitoring production systems. • NightOwl for SRT Platform; I know how pagers work. Who are you, and why are you telling us this? 3 Introduction
  • 5. Data Center VS Cloud VM’s and Servers VS containers in Kubernetes 5 Monitoring Prometheus Metrics Prometheus (+ InfluxDB/Thanos) Alerting AlertManager, Iris, OnCall, Grafana Visualization Grafana Logging StackDriver, ElasticSearch + Kibana Monitoring Nagios + Thruk + Lookingglass Metrics Graphite + Statsd Alerting SMS modems in physical servers Visualization Grafana Logging ElasticSearch + Kibana
  • 6. •Applications in Kubernetes are much more dynamic than we’re used to. • No Static IP addresses. • No Static amount servers (Well, pods actually..) • Kubernetes can reschedule / relocate pods at will. • Prometheus uses Service Discovery to find targets •Both Nagios and Graphite have scaling issues and are too rigid. • Prometheus is Pull instead of Push based and doesn’t require execution for every single check • Combines Metrics & Monitoring into a single stack, but focuses on Monitoring. •Being based on BorgMon, it works out of the box with a lot of Kubernetes / Cloud native components and the services supporting them. •StackDriver is not a full fledged alternative due to features, retention and cost. Why didn’t you come up with something else? 6 So, why Prometheus?
  • 7. •Out of the box, Prometheus also doesn’t scale endlessly without compromises (But Thanos will) •Scalability is solved through retention, manual sharding and vertical scaling, which all have clear drawbacks. •HA is solved through duplication (Polling twice from independent instances with individual TSDB’s). •Prometheus development is very focused, which shows in certain aspects. Well.. No. 7 Is this the answer to everything then?
  • 8. All the pods & services 8 Infrastructure Overview Kubernetes {DEV, STG, PRO} Clusters Datacenters Prometheus Prometheus AlertManager AlertManager AlertManager Grafana PushGateway IRIS OnCall SMS / Call Provider HipChat Operator Remote Storage Adapter InfluxDB YOUR App! Kubernetes Exporters
  • 9. Core Concepts How does it work and what makes it tick?
  • 10. - Counters - A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. (1, 2, 5, 9, 0, 2, 7) - Gauges - A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. (1, 4, 2, 5, 8) - Histograms - A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values. - Summaries - Similar to a histogram, a summary samples observations (usually things like request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window. - Quantiles are convenient when (for example) expressing median (2-quantile) and 95th percentiles. Supported Types 10 Making Metrics
  • 11. - Instead of creating separate checks for every metric that should be monitored for your application, you expose a single (or multiple..) HTTP Endpoint containing all metrics. - It’s your responsibility to make this endpoint Available, Fast and Reliable. - Multiple Frameworks and Libraries can help you provisioning and maintaining such an endpoint. - Axle Comes with built-in support for MicroMeter, which does everything for you. - Backspin support is coming soon™. - Example: http://localhost:30000/metrics The concept of Scraping HTTP Metric Endpoints 11 Exposing Metrics: Push VS Pull # HELP prometheus_tsdb_head_min_time Minimum time bound of the head block. # TYPE prometheus_tsdb_head_min_time gauge prometheus_tsdb_head_min_time 1.5282792e+12 # HELP prometheus_tsdb_head_samples_appended_total Total number of appended samples. # TYPE prometheus_tsdb_head_samples_appended_total counter prometheus_tsdb_head_samples_appended_total 2.9485092e+07 # HELP prometheus_tsdb_head_series Total number of series in the head block. # TYPE prometheus_tsdb_head_series gauge prometheus_tsdb_head_series 19956 # HELP prometheus_tsdb_head_series_created_total Total number of series created in the head # TYPE prometheus_tsdb_head_series_created_total gauge prometheus_tsdb_head_series_created_total 56888
  • 12. - An actual Query Language that looks a lot more like SQL than Graphite. - You’ll need to learn a new language, but it’s only a single language for creating Graphs and Alerts; for monitoring and long term metrics. - Allows for a lot of flexibility, but can be a bit harder to grasp when starting out. - Supports functions, operators, regex, arithmetic and expressions. - Four expression types are supported: - Instant Vectors (like http_requests_total{environment=~"staging|testing|development", method!="GET"}) - Instant vector selectors allow the selection of a set of time series and a single sample value for each at a given timestamp (instant): in the simplest form, only a metric name is specified. This results in an instant vector containing elements for all time series that have this metric name. - Range Vectors (like http_requests_total{job="prometheus"}[5m] ) - Range vector literals work like instant vector literals, except that they select a range of samples back from the current instant. Syntactically, a range duration is appended in square brackets ([]) at the end of a vector selector to specify how far back in time values should be fetched for each resulting range vector element. - Scalars - Strings PromQL 12 Querying Metrics
  • 13. - Custom Resource Type provided by Prometheus-operator - Abstraction of Prometheus “job” and Service Discovery - Allows for easy ingestion of new endpoints through their k8s service - Example: ServiceMonitors 13 Getting your endpoint monitored Prometheus Prometheus OperatorYOUR App! K8s Service ServiceMonitor apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor spec: endpoints: - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token interval: 30s port: https scheme: https tlsConfig: insecureSkipVerify: true jobLabel: k8s-app selector: matchLabels: k8s-app: node-exporter apiVersion: v1 kind: Service metadata: labels: k8s-app: node-exporter name: node-exporter spec: ports: - name: https port: 9100 protocol: TCP targetPort: https selector: app: node-exporter type: ClusterIP
  • 14. - The same tool you were probably already using. - The central interface for cloud insights - Contains a specialized query editor for Prometheus data sources. - Prometheus currently doesn’t store metrics older than one month for performance reasons. - Multiple solutions for long term metrics exist, but it’s a work in progress. Dashboarding with Grafana 14 Creating Insights Prometheus Prometheus Grafana HipChat Remote Storage Adapter InfluxDB
  • 15. Trouble in Paradise Creating Alerts, choosing your weapon 15 WARNINGS – Notifications During workhours - No direct intervention is required - Usually picked up by members of the team developing / maintaining a system. - Alert delivery is NOT guaranteed. Use Grafana with HipChat or Email alerts CRITICALS – 24x7 Text Messages with Escalation - Actionable events that require immediate attention by an Engineer on Duty, who does not necessarily have intimate knowledge of your system. - Response is required to silence/end the alert. - Provisioned through RuleList (R2D2 / Operator) Use AlertManager / Iris / Oncall
  • 16. Yes, It’s PromQL as well! 16 Alert Basics %YAML 1.1 --- kind: PrometheusAlertRule Data: test.rules: | Groups: - name: Load interval: 30s Rules: - alert: HighLoad expr: rate(web_http_responses_total[1m]) > 1 for: 1m Labels: Severity: attention Annotations: description: The rate of HTTP requests is too high.
  • 17. - Alerts should be actionable: Somebody has to do something, now. - They should be simple: Someone without intimate knowledge of the system should ideally be able to solve the alert. - They should be urgent and require human intervention: No point in waking someone up if they shouldn’t have to do something, or when tomorrow afternoon would be soon enough. - Provide accurate descriptions and a playbook where possible. - Basic system monitoring should be based on SLI/SLO’s rather than infra metrics. - Prefer AM/Iris/OnCall if you’re serious about your alert. Creating the perfect alert 17 Alert Perfection Prometheus AlertManager AlertManager AlertManager Grafana IRIS OnCall SMS / Call Provider HipChat
  • 18. • A long list of exporters is available at https://prometheus.io/docs/instrumenting/exporters/ • A number of these come preconfigured with our Kubernetes clusters and provide additional metrics When artisanal endpoints don’t cut the cake 18 Exporters - Additional sources of metrics Databases Aerospike exporter ClickHouse exporter Consul exporter (official) CouchDB exporter ElasticSearch exporter Memcached exporter (official) MongoDB exporter MSSQL server exporter MySQL server exporter (official) OpenTSDB Exporter Oracle DB Exporter PgBouncer exporter PostgreSQL exporter ProxySQL exporter RavenDB exporter Redis exporter RethinkDB exporter SQL exporter Tarantool metric library Hardware related apcupsd exporter Collins exporter IoT Edison exporter IPMI exporter knxd exporter Node/system metrics exporter (official) Ubiquiti UniFi exporter Messaging systems Beanstalkd exporter Gearman exporter Kafka exporter NATS exporter NSQ exporter Mirth Connect exporter MQTT blackbox exporter RabbitMQ exporter RabbitMQ Management Plugin exporter Storage Ceph exporter Ceph RADOSGW exporter Gluster exporter Hadoop HDFS FSImage exporter Lustre exporter ScaleIO exporter HTTP Apache exporter HAProxy exporter (official) Nginx metric library Nginx VTS exporter Passenger exporter Tinyproxy exporter Varnish exporter WebDriver exporter APIs AWS ECS exporter AWS Health exporter AWS SQS exporter Cloudflare exporter DigitalOcean exporter Docker Cloud exporter Docker Hub exporter GitHub exporter InstaClustr exporter Mozilla Observatory exporter OpenWeatherMap exporter Pagespeed exporter Rancher exporter Speedtest exporter Logging Fluentd exporter Google's mtail log data extractor Grok exporter Other monitoring systems Akamai Cloudmonitor exporter AWS CloudWatch exporter (official) Cloud Foundry Firehose exporter Collectd exporter (official) Google Stackdriver exporter Graphite exporter (official) Heka dashboard exporter Heka exporter InfluxDB exporter (official) JavaMelody exporter JMX exporter (official) Munin exporter Nagios / Naemon exporter New Relic exporter NRPE exporter Osquery exporter Pingdom exporter scollector exporter Sensu exporter SNMP exporter (official) StatsD exporter (official) Miscellaneous Bamboo exporter BIG-IP exporter BIND exporter Bitbucket exporter Blackbox exporter (official) BOSH exporter cAdvisor Confluence exporter Dovecot exporter eBPF exporter Jenkins exporter JIRA exporter Kannel exporter Kemp LoadBalancer exporter Meteor JS web framework exporter Minecraft exporter module PHP-FPM exporter PowerDNS exporter Process exporter rTorrent exporter SABnzbd exporter Script exporter Shield exporter SMTP/Maildir MDA blackbox prober SoftEther exporter Transmission exporter Unbound exporter Xen exporter
  • 19. • StackDriver Exporter- Get your GCP Project’s native metrics into Prometheus. • Blackbox Exporter – Monitor Golden Signals on any system, without knowledge about the inner working • Nginx exporter – used in Ingresses • SNMP Exporter – Bring your own MIB’s. • Statsd Exporter – Push your statsd metrics to a sidecar container • Node Exporter – Provides system metrics for VM and Physical systems (like kubernetes nodes) • cAdvisor – Get generic container metrics • Etcd • Kubernetes • Minio (Gitlab Runner Caching) The most commonly used 19 Exporters - Highlights Prometheus Prometheus OperatorExporter K8s Service ServiceMonitor
  • 20. • For situations where you are unable to serve a HTTP metrics page for a reliable period of time. • Ideal for short running tasks like Kubernetes CronJobs, Hadoop Jobs, Scripts, etc. • Allows you to Push (through a HTTP call) Metrics to buffering service, which in turn exposes them to Prometheus. • Metrics will live forever on the Gateway, so be careful of what you push and how you name them. • Avoid this route if possible, since it scales very badly and is NOT redundant. Bring your own endpoint if and when possible. • PRO-Tip: If you have an ephemeral job, also push the timestamp of last successful job completion. The Push Gateway 20 Metrics for ephemeral jobs Prometheus PrometheusYOUR App! Push Gateway echo ”ultimate_answer 42.0" | curl --data-binary @- http://gateway:9091/metrics/job/magrathea/instance/zaphod-001/group/vogon/opex/DPI ultimate_answer{group=”vogon",instance=”zaphod-001",job=”magrathea",opex=”DPI"} 42.0
  • 22. • Kubernetes Running on Docker for macOS. • Out of the box Prometheus on Kubernetes from https://github.com/coreos/prometheus- operator/tree/master/contrib/kube-prometheus • Services are running without an Ingress, so we’re accessing them directly, using NodePorts. • We’re going to add our own Full Featured Axle Service by creating a Deployment and a Service to match it, adding a ServiceMonitor, watching Service Discovery do it’s thing, graphing one of the metrics and creating an alert for it. • Prometheus: http://localhost:30000/graph • AlertManager: http://localhost:31000/#/alerts • Grafana: http://localhost:32000/d/9dP_FHImz/pods Getting started in 5 minutes 22 Today’s Quick Demo
  • 23. Tips & Tricks Getting the most out of your Prometheus Experience
  • 24. • Metrics in Prometheus are multi dimensional; They consist of names and labels. • Names are generic identifiers to tell WHAT you are measuring, in what format. • Metric Names SHOULD have a single (base!) unit, added as a suffix describing that unit. (bytes, seconds, meters) • Labels describe characteristics, and are usually used to identify WHERE those metrics are coming from, and can be multi faceted. • Prometheus saves a separate Time Series for each name/labels combination, so you have to ensure label cardinality does not get too high, or you will kill Prometheus in the end. (Bad examples: usernames, internet IP addresses, hashes). • Read https://prometheus.io/docs/practices/naming/ before you start making your own! Keep things running smoothly by not making a mess. 24 Metric Naming api_http_requests_total { type="create|update|delete”, method=“GET|POST|DELETE” } api_request_duration_seconds { stage="extract|transform|load” } api_errors_total { endpoint=“listProducts|updatePricing”, code=“500|404|418 I'm a teapot” }
  • 25. •An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided. •An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus [SLI ≤ target], or [lower bound ≤ SLI ≤ upper bound]. •Symptoms vs Causes: Monitor things that users will notice when using your system. •Latency - The time it takes to service a request. •Traffic. - A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second. •Errors - The rate of requests that fail (like HTTP 500’s) •Saturation- "How "full" your service is. A measure of your system fraction, emphasizing the resources that are most constrained. What should you be monitoring? 25 The Golden Signals
  • 26. •BlackBox Exporter for period requests and their Metrics (Success, Latency, Errors) •Nginx Ingress Metrics for a man-in-the-middle view of your application (Flow, Latency, Errors) •Your own application’s Metrics for insights, details and under-the-hood view. Combining Metric Sources for an unbiassed view 26 Bringing it all together Your App Blackbox Exporter Ingress Poll Metrics Ingress Metrics App Metrics - job_name: 'blackbox’ metrics_path: /probe params: module: [http_2xx] # Look for a HTTP 200 response. static_configs: - targets: - http://myapp.behindingress.io # Target to probe with http Prometheus scrape
  • 27. •Introducing the GenericServiceMonitor and DCServiceMonitor •These types allow you to define endpoints outside of Kubernetes, and allow you to monitor on-premise services. •DCServiceMonitor works based on bol_applications and as such is bol.com specific: •GenericServiceMonitor works on static endpoints My stuff runs in the DC and I want to keep it there. 27 So what about non-Cloud resources? kind: Prometheus/DCServiceMonitor name: tst-sdd-app spec: port: 8080 path: /internal/metrics kind: Prometheus/GenericServiceMonitor name: dev-atscale-app Spec: hosts: - ip: 1.2.3.4 hostname: some.host.name port: 8080 path: /internal/metrics opex: srt-bificsps
  • 28. •Always initialize your metrics at zero when possible, or you won’t know the significance of the first value. •How do you know if your application is OK when the metrics stopped working? The up metric might also disappear when Service Discovery no longer detects your service. Always use absent() to check for existence of up! •(i)rate()/increase() then sum(), not sum() then (i)rate()/increase(), since those are the only safe functions to deal with resets. •The rate function takes a time series over a time range, and based on the first and last data points within that range (http://localhost:32000/d/h3RZO2Iik/rate-vs-irate?orgId=1 ) •By contrast irate is an instant rate. It only looks at the last two points within the range passed to it and calculates a per-second rate. •To complement the saturation signal; Prometheus has predict_linear() for Gauges. •All the metrics? http://localhost:30000/federate?match[]={__name__%3D~%22[a-z].*%22} Things you’ll encounter once you start making queries 28 Other tips
  • 29. Questions? Don’t bother to ask me the Ultimate Question of Life, the Universe and Everything, because you already know the answer. (and yes, I know where my towel is.)