Outdated training deck for Prometheus monitoring tool - shared as a basis for newer content for potential MeetUp and Conference talks. I'm sharing it since there is some intrinsic value remaining.
1. Prometheus Monitoring – Docker Enterprise Edition
Tim Tyler – Docker Captain
January 03, 2017
This is a training deck I originally developed in December 2016 and presented as part of a
company training plan for the Docker Enterprise Edition platform.
As of this edit it is 2019 and substantially dated – that is all of the tool stack has moved
forward significantly as well as some of the nuts and bolts originally described here (for
instance its very easy to not need HAProxy as described in favor of Interlock) I’ve decided to
share it – as it does have some intrinsic value remaining and can form the basis for an
updated and modernized version and potential MeetUp talk.
I’ve removed about 10% of the original content that was company specific or proprietary,
leaving only publicly available detail, and obscured some data. Many of the images worked
better on a white background, and rather than fiddle too much with them I’ve just applied
some quick picture styles.
It would be very easy to base an updated tech stack on this document and install a portable
training system on a Raspberry Pi. I am currently building a Prometheus and Grafana based
system for monitoring, alerting, and visualizing my Samsung SmartThings home automation
on a spare iMac.
@timotyler
ttyler
4. Monitoring Stack Overview
5
I'm still passionately interested in what my fellow humans are up to. For me, a day
spent monitoring the passing parade is a day well-spent. - Garry Trudeau
5. Monitoring containerized and microservices environments present new challenges.
Containers can be highly ephemeral
Microservervices are able to scale up and down to meet design and performance criteria
Microservices may exist for seconds, or persist indefinitely
Microservices are generally a single process
Containers live on hosts, but hosts are just pooled resources
Generally we don’t think about what host an application microservice is running on
Instances of a microservice may live on multiple hosts in a Docker Swarm
Instances of a microservice may move to different hosts within a Docker Swarm
The Swarm is a pool, and the microservices just swim in it
Monitoring, like the microservice architecture, needs to be elastic
6
What’s the Problem?
6. We have options, and several are readily available
Prometheus
Time series dimensional data model with Docker aware agents
Dynatrace
Specialist in application performance monitoring with Docker support
SignalFX
Newer offering with native Docker support
Sysdig
Swiss army knife for infrastructure and microservices monitoring
7
Can We Solve This?
7. Prometheus is a Pitts S-2A RC muscle biplane
Prometheus is a prequel and fifth installment in the Alien franchise
Prometheus is a Greek Titan that gave us fire and suffered an unfortunate fate
involving a hungry eagle and his liver
Prometheus is a leading Open Source monitoring solution
Prometheus is straightforward to implement as a primary cluster monitoring
stack
A complete stack can also include the Open Source data visualization tool
Grafana
8
PROMETHEUS!
What’s Prometheus?
8. Dimensional Data Metric Collector
Interactive Query Engine
Calculator for discrete multidimensional data streams
Great Visualization
Efficient Storage
Simple Operation
Alerting
Many Client Libraries
Many Integrations
9
PROMETHEUS!
But what does it do?
9. Prometheus Server
Scrapes and stores time series data
Alertmanager
Handles alerts generated by Prometheus Server, deduplicating, grouping, and routing alerts to configured receivers
AM-Exporter
Receiver to transmit alerts from Alertmanager to custom intake process
Exporters
Agents with specific duties that collect metrics and present them to Prometheus Server
cAdvisor, node-exporter, blackbox-exporter
Grafana
Data visualization
HaProxy
Routes calls to Prometheus Server, Alertmanager, and Grafana within the Docker overlay network
10A typical Prometheus Monitoring Stack
12. Infrastructure as Code
IaC is to treat the configuration of systems the same way that software code is treated
We’re all devs now
Automate and modularize
Apply test pyramid
Version control changes, patches, and releases
Share work! (Because DevOps)
Installed via Docker orchestration and some basic automation
Makefile driven
Apply environment specific customizations (hostnames, passwords, alerts, etc.) to config files
Deploy configs across cluster
13
Stack Installation
14. 15
Why should the thirst for knowledge be
aroused, only to be disappointed and
punished? Yet, like a second
Prometheus, I will endure this and worse
- Edwin Abbott in Flatland: A Romance of Many
Dimensions (1884)
15. Open Source systems monitoring and alerting tool originally
built at SoundCloud
Very active developer and user community
Docs and stuff
https://prometheus.io/docs/introduction/overview/
16
Prometheus Server
16. Collect and store time series data
Scrape defined targets for functionally specific data
Discover targets statically or dynamically
Evaluate rulesets
Allow vector arithmetic
Send alerts
17
What can Prometheus Server Do?
19. Prometheus Exporters are basically agents that are responsible for
collecting application specific, time series, metrics and presenting
them via an API endpoint for Prometheus to collect.
20
What Are Exporters?
20. Prometheus has support either directly, or via third parties, for
dozens of exporters. Some tools have been directly instrumented
to provide a Prometheus endpoint such as etcd, cAdvisor,
Kubernetes, and Docker.
Custom, business specific, exporters can be easily written in any
language, however Go seems popular.
21
A Bunch of Exporters
21. A basic Docker monitoring stack implements 3 exporters
cAdvisor
Provides metrics on docker and container environment
node-exporter
exporter for hardware and OS metrics exposed by the kernel
blackbox-exporter
allows blackbox probing of endpoints over HTTP, HTTPS, DNS, TCP and ICMP.
22
What is a Minimal Set of Exporters?
22. Pushgateway
allow ephemeral and batch jobs to expose their metrics
HAProxy-exporter
periodically scrapes HAProxy stats and exports them via HTTP/JSON for Prometheus
JMX-exporter
configurably scrape and expose mBeans of a JMX target
Mongodb-exporter
Rabbitmq-exporter
23
What are Some Other Exporters?
27. Prometheus provides a functional language that lets the user select and
aggregate time series data in real time. Results can be rendered as
follows:
Displayed in a graph
Viewed as tabular data
Consumed by external systems
Grafana for instance
28
The Basics
28. Instant vector
A set of time series data containing a single sample for each series
Range vector
A set of time series data containing a range of data points over time
Scalar
A simple numeric floating point value
29
Data Types
Prometheus has 3 basic data types
29. 30
Operators
Prometheus supports basic logical and arithmetic operators
Arithmetic Operators Comparison Operators Aggregation Operators
+ (addition)
- (subtraction)
* (multiplication)
/ (division)
% (modulo)
^ exponentiation)
== (equal)
!= (not equal)
> (greater than)
< (less than)
>= (greater or equal)
<= (less or equal)
sum
min
max
avg
count
topk
30. sum
count
irate
sort
topk
time
31
Functions
Prometheus supports about 40 built in functions
32. sort_desc(
topk(5,
sum by (image) (
irate(container_cpu_usage_seconds_total {
id=~"/docker/.*"}[5m]
)
)
)
)
33
To edit go to: Insert > Header and Footer
Fancy Query
Top 5 Docker Images by CPU
34. 35
Big things have small beginnings –
David, from the movie Prometheus
(2012)
Lets build an Alert!
35. Alerts are just queries with comparison operators
Alerts are written in a simple format in a plain text file
Alerts can be decorated with interesting metadata
Alert metadata can be templated
Alerts can be sent to an external service
36
First Things First
43. ALERT NodeDown
IF up{job="node"} == 0
FOR 1m
LABELS {prdcode=“0000", host=“Shared_Infra", severity="critical", support="Prometheus_Critical"}
ANNOTATIONS {
description="{{$labels.instance}} of job {{$labels.job}} has been down for more than 1 minutes.",
rosguide="Please see Application ROS guide",
summary="Instance {{$labels.instance}} down“
}
44
The Anatomy of an Alert
And go back and turn our earlier query into an alert
44. OptimusPrime (bot)3:32 PM
AlertManager message: [FIRING:1] NodeDown (0000 prod
node.metrics Shared_Infra node app critical
Prometheus_Critical). Learn more at
https://somewhere.dockeralerts.company.com:8443/#/alert
s?receiver=ChatBot
45
A NodeDown Alert Sent To Chat
Fate rarely calls on us at a moment of our choosing – Optimus Prime
45. Rules/Alerts are segregated into functionally specific rule files
alert.rules
basic alert installed with 1 rule ‘IF up{job="node"} == 0’
alert.infra.logging.rules
Logging ruleset
alert.infra.monitoring.rules
Monitoring stack rules
alert.infra.rules
Basic infrastructure rules such as file systems, memory, and thinpool
alert.service.app.prod.rules
Service level rules such as redis, mongodb, rabbitmq, etc.
alert.docker.rules
Rules for Docker itself
alert.0000.app.rules
Application specific rules
46
How are Rules/Alerts Categorized?
47. Grafana is a leading Open Source Data Visualization Tool
Create and share intuitive dashboards
Rich graphing and charting
Mixed styling within a dashboard
Dashboard templates
Lots of additional features
48
What is Grafana?
50. The Infrastructure Monitoring Stack is currently considered v1.0
Prometheus v1.3.1
Grafana v3.1.1
Alertmanager v0.4.2 custom-v2
HaProxy v1.6.9
cAdvisor v0.24.1
Node-exporter v0.12.0
Blackbox-exporter v0.2.0
51
Whats in Your Stack?
51. We use Git to manage configurations and changes to the tech stack. Git is a distributed
version control system.
Simple to use
Enables code collaboration
Eases deployments
https://somewhere.company.com/git/projects/PRJ0000/repos/infra-prom-
stack/browse
52
Tech Stack SOA
52. The Monitoring Stack is deployed and configured from 1 location in each Docker Swarm, this is typically on the first Docker
Master Node.
Configuration files
/company/compose/infra-prom-stack
Prometheus Configuration
/company/compose/infra-prom-stack/infra/prometheus/config/prometheus.yml
Alertmanager Configuration
/company/compose/infra-prom-stack/infra/prometheus/config/alertmanager.conf
Alert Files
/company/compose/infra-prom-stack/infra/prometheus/alerts
53
Basic Stack Deployment
53. The Makefile simplifies stack management by reducing error prone commands to simple make targets. It is used to both
configure and install the Monitoring Stack, and to manage the stack during runtime. Some examples:
make pushconfigs-all
Distributes configuration to all Swarm nodes
make hup-prometheus
Gently restarts Prometheus Server after a configuration change
make start
Equivalent to a `docker compose up` with cluster specific information
make start-all
Starts the stack and scales all required services
54
Controlling the Stack
54. These commands are run from the /company/compose/infra-prom-stack on the first Master Node
There are 1 or more cAdvisor containers down
Restart via UCP
If that fails remove the stopped containers
Run `make scale-cadvisor` from /company/compose/infra-prom-stack
There are 1 or more node-exporter containers down
Restart via UCP
If that fails remove the stopped containers
Run `make scale-node-exporter` from /company/compose/infra-prom-stack
Cannot connect to Prometheus Server, Grafana, or Alertmanager
Validate they are up via UCP
Occasionally HAProxy seems to get confused and needs a simple restart via UCP
55
Fixing Some Basic Problems
57. Infrastructure Monitoring and Logging services are currently
deployed as shared infrastructure services in a Docker Overlay
network.
Overlay name: infra_netmon
Monitoring stack
Logging stack
58
Network Overlay and Shared Services
58. Prometheus is Federated, enabling existing Prometheus Servers to monitor other Prometheus Servers.
north-nonprod monitors both
east
west
east monitors
west
west monitors
east
Basic synthetic monitoring
59
Federation
Who monitors the monitors?
59. If we stick with Prometheus then there are several improvements that will need exploration and engineering
Integrate configuration and deployment via a CI/CD pipeline
Improve and refine Rules/Alerts
Update Prometheus Server to latest version
Not much to gain here at the moment
Update Grafana to latest version
Some interesting new features including built in alerts
Back Grafana with a relational database
Enables persistent annotations
Engineer HA Prometheus and Alertmanager within a cluster
Figure out a better persistent storage strategy
This is bigger than Prometheus/Monitoring
60
Future Work
60. Since this is an Open Source solution we will have new tradeoffs vs. a fully vendored solution. The following resources are suggested for those
wanting to dive deeper into this technology stack.
See the Prometheus docs, GitHub repo, YouTube videos, and Robust Perception blog
https://prometheus.io/docs/introduction/overview/
https://github.com/prometheus/prometheus
https://www.youtube.com/watch?v=gNmWzkGViAY&t
https://www.robustperception.io/blog/
See the Grafana docs, GitHub repo, and Screencasts
http://docs.grafana.org/
https://github.com/grafana/grafana
https://www.youtube.com/playlist?list=PLDGkOdUX1Ujo3wHw9-z5Vo12YLqXRjzg2
See the cAdvisor GitHub repo
https://github.com/google/cadvisor
61
Want to Learn More?
61. Microservices are (intended to be) ephemeral
We need to monitor potentially transient services and act accordingly
This is an Open Source solution down the stack
Prometheus is targeted to replace existing on-prem roles
Capable of very basic synthetics
Can set up service level monitoring for mongodb, rabbitmq, etc(d).
Interface with 3rd party connectors
Alerts are easy to create and manage
Deployed as Infrastructure as Code
Embrace DevOps
62
Key Points