SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Prometheus Monitoring – Docker Enterprise Edition
Tim Tyler – Docker Captain
January 03, 2017
This is a training deck I originally developed in December 2016 and presented as part of a
company training plan for the Docker Enterprise Edition platform.
As of this edit it is 2019 and substantially dated – that is all of the tool stack has moved
forward significantly as well as some of the nuts and bolts originally described here (for
instance its very easy to not need HAProxy as described in favor of Interlock) I’ve decided to
share it – as it does have some intrinsic value remaining and can form the basis for an
updated and modernized version and potential MeetUp talk.
I’ve removed about 10% of the original content that was company specific or proprietary,
leaving only publicly available detail, and obscured some data. Many of the images worked
better on a white background, and rather than fiddle too much with them I’ve just applied
some quick picture styles.
It would be very easy to base an updated tech stack on this document and install a portable
training system on a Raspberry Pi. I am currently building a Prometheus and Grafana based
system for monitoring, alerting, and visualizing my Samsung SmartThings home automation
on a spare iMac.
@timotyler
ttyler
3
Who’s Keeping An Eye On Your
Containers?
 Monitoring Stack Overview
 Prometheus
 Exporters
 Alertmanager
 Queries
 Alerts
Nuts and Bolts
Questions
4
Agenda
Monitoring Stack Overview
5
I'm still passionately interested in what my fellow humans are up to. For me, a day
spent monitoring the passing parade is a day well-spent. - Garry Trudeau
 Monitoring containerized and microservices environments present new challenges.
 Containers can be highly ephemeral
 Microservervices are able to scale up and down to meet design and performance criteria
 Microservices may exist for seconds, or persist indefinitely
 Microservices are generally a single process
 Containers live on hosts, but hosts are just pooled resources
 Generally we don’t think about what host an application microservice is running on
 Instances of a microservice may live on multiple hosts in a Docker Swarm
 Instances of a microservice may move to different hosts within a Docker Swarm
 The Swarm is a pool, and the microservices just swim in it
 Monitoring, like the microservice architecture, needs to be elastic
6
What’s the Problem?
 We have options, and several are readily available
 Prometheus
Time series dimensional data model with Docker aware agents
 Dynatrace
Specialist in application performance monitoring with Docker support
 SignalFX
Newer offering with native Docker support
 Sysdig
Swiss army knife for infrastructure and microservices monitoring
7
Can We Solve This?
 Prometheus is a Pitts S-2A RC muscle biplane
 Prometheus is a prequel and fifth installment in the Alien franchise
 Prometheus is a Greek Titan that gave us fire and suffered an unfortunate fate
involving a hungry eagle and his liver
 Prometheus is a leading Open Source monitoring solution
Prometheus is straightforward to implement as a primary cluster monitoring
stack
A complete stack can also include the Open Source data visualization tool
Grafana
8
PROMETHEUS!
What’s Prometheus?
 Dimensional Data Metric Collector
 Interactive Query Engine
 Calculator for discrete multidimensional data streams
 Great Visualization
 Efficient Storage
 Simple Operation
 Alerting
 Many Client Libraries
 Many Integrations
9
PROMETHEUS!
But what does it do?
 Prometheus Server
 Scrapes and stores time series data
 Alertmanager
 Handles alerts generated by Prometheus Server, deduplicating, grouping, and routing alerts to configured receivers
 AM-Exporter
 Receiver to transmit alerts from Alertmanager to custom intake process
 Exporters
 Agents with specific duties that collect metrics and present them to Prometheus Server
 cAdvisor, node-exporter, blackbox-exporter
 Grafana
 Data visualization
 HaProxy
 Routes calls to Prometheus Server, Alertmanager, and Grafana within the Docker overlay network
10A typical Prometheus Monitoring Stack
11
Prometheus Architecture
12
Prometheus Implementation
 Infrastructure as Code
 IaC is to treat the configuration of systems the same way that software code is treated
 We’re all devs now
 Automate and modularize
 Apply test pyramid
 Version control changes, patches, and releases
 Share work! (Because DevOps)
 Installed via Docker orchestration and some basic automation
 Makefile driven
 Apply environment specific customizations (hostnames, passwords, alerts, etc.) to config files
 Deploy configs across cluster
13
Stack Installation
Prometheus
14
15
Why should the thirst for knowledge be
aroused, only to be disappointed and
punished? Yet, like a second
Prometheus, I will endure this and worse
- Edwin Abbott in Flatland: A Romance of Many
Dimensions (1884)
 Open Source systems monitoring and alerting tool originally
built at SoundCloud
 Very active developer and user community
 Docs and stuff
https://prometheus.io/docs/introduction/overview/
16
Prometheus Server
 Collect and store time series data
 Scrape defined targets for functionally specific data
 Discover targets statically or dynamically
 Evaluate rulesets
 Allow vector arithmetic
 Send alerts
17
What can Prometheus Server Do?
18
Prometheus Has a Really Boring UI
We’ll go poke around for a minute
Exporters
19
Prometheus Exporters are basically agents that are responsible for
collecting application specific, time series, metrics and presenting
them via an API endpoint for Prometheus to collect.
20
What Are Exporters?
Prometheus has support either directly, or via third parties, for
dozens of exporters. Some tools have been directly instrumented
to provide a Prometheus endpoint such as etcd, cAdvisor,
Kubernetes, and Docker.
Custom, business specific, exporters can be easily written in any
language, however Go seems popular.
21
A Bunch of Exporters
A basic Docker monitoring stack implements 3 exporters
 cAdvisor
 Provides metrics on docker and container environment
 node-exporter
 exporter for hardware and OS metrics exposed by the kernel
 blackbox-exporter
 allows blackbox probing of endpoints over HTTP, HTTPS, DNS, TCP and ICMP.
22
What is a Minimal Set of Exporters?
 Pushgateway
 allow ephemeral and batch jobs to expose their metrics
 HAProxy-exporter
periodically scrapes HAProxy stats and exports them via HTTP/JSON for Prometheus
 JMX-exporter
configurably scrape and expose mBeans of a JMX target
 Mongodb-exporter
 Rabbitmq-exporter
23
What are Some Other Exporters?
Alertmanager
24
25
Alertmanager
26
Alertmanager
Queries
27
Prometheus provides a functional language that lets the user select and
aggregate time series data in real time. Results can be rendered as
follows:
 Displayed in a graph
 Viewed as tabular data
 Consumed by external systems
Grafana for instance
28
The Basics
 Instant vector
A set of time series data containing a single sample for each series
 Range vector
A set of time series data containing a range of data points over time
 Scalar
A simple numeric floating point value
29
Data Types
Prometheus has 3 basic data types
30
Operators
Prometheus supports basic logical and arithmetic operators
Arithmetic Operators Comparison Operators Aggregation Operators
+ (addition)
- (subtraction)
* (multiplication)
/ (division)
% (modulo)
^ exponentiation)
== (equal)
!= (not equal)
> (greater than)
< (less than)
>= (greater or equal)
<= (less or equal)
sum
min
max
avg
count
topk
 sum
 count
 irate
 sort
 topk
 time
31
Functions
Prometheus supports about 40 built in functions
32
Simple Query
Whats up?
sort_desc(
topk(5,
sum by (image) (
irate(container_cpu_usage_seconds_total {
id=~"/docker/.*"}[5m]
)
)
)
)
33
To edit go to: Insert > Header and Footer
Fancy Query
Top 5 Docker Images by CPU
Alerts
34
35
Big things have small beginnings –
David, from the movie Prometheus
(2012)
Lets build an Alert!
 Alerts are just queries with comparison operators
 Alerts are written in a simple format in a plain text file
 Alerts can be decorated with interesting metadata
 Alert metadata can be templated
 Alerts can be sent to an external service
36
First Things First
37
The Anatomy of an Alert
An alert starts with a Query – like up
38
The Anatomy of an Alert
This is more info than we want though
39
The Anatomy of an Alert
What we really want is to count how many we have
40
The Anatomy of an Alert
Or change how we count them
41
The Anatomy of an Alert
And do some math
42
The Anatomy of an Alert
Check out a quick chart
43
The Anatomy of an Alert
This is more fun
ALERT NodeDown
IF up{job="node"} == 0
FOR 1m
LABELS {prdcode=“0000", host=“Shared_Infra", severity="critical", support="Prometheus_Critical"}
ANNOTATIONS {
description="{{$labels.instance}} of job {{$labels.job}} has been down for more than 1 minutes.",
rosguide="Please see Application ROS guide",
summary="Instance {{$labels.instance}} down“
}
44
The Anatomy of an Alert
And go back and turn our earlier query into an alert
 OptimusPrime (bot)3:32 PM
 AlertManager message: [FIRING:1] NodeDown (0000 prod
node.metrics Shared_Infra node app critical
Prometheus_Critical). Learn more at
https://somewhere.dockeralerts.company.com:8443/#/alert
s?receiver=ChatBot
45
A NodeDown Alert Sent To Chat
Fate rarely calls on us at a moment of our choosing – Optimus Prime
 Rules/Alerts are segregated into functionally specific rule files
 alert.rules
 basic alert installed with 1 rule ‘IF up{job="node"} == 0’
 alert.infra.logging.rules
 Logging ruleset
 alert.infra.monitoring.rules
 Monitoring stack rules
 alert.infra.rules
 Basic infrastructure rules such as file systems, memory, and thinpool
 alert.service.app.prod.rules
 Service level rules such as redis, mongodb, rabbitmq, etc.
 alert.docker.rules
 Rules for Docker itself
 alert.0000.app.rules
 Application specific rules
46
How are Rules/Alerts Categorized?
Grafana
47
 Grafana is a leading Open Source Data Visualization Tool
 Create and share intuitive dashboards
 Rich graphing and charting
 Mixed styling within a dashboard
 Dashboard templates
 Lots of additional features
48
What is Grafana?
49
Nuts and Bolts
50
The Infrastructure Monitoring Stack is currently considered v1.0
 Prometheus v1.3.1
 Grafana v3.1.1
 Alertmanager v0.4.2 custom-v2
 HaProxy v1.6.9
 cAdvisor v0.24.1
 Node-exporter v0.12.0
 Blackbox-exporter v0.2.0
51
Whats in Your Stack?
We use Git to manage configurations and changes to the tech stack. Git is a distributed
version control system.
 Simple to use
 Enables code collaboration
 Eases deployments
 https://somewhere.company.com/git/projects/PRJ0000/repos/infra-prom-
stack/browse
52
Tech Stack SOA
The Monitoring Stack is deployed and configured from 1 location in each Docker Swarm, this is typically on the first Docker
Master Node.
 Configuration files
 /company/compose/infra-prom-stack
 Prometheus Configuration
 /company/compose/infra-prom-stack/infra/prometheus/config/prometheus.yml
 Alertmanager Configuration
 /company/compose/infra-prom-stack/infra/prometheus/config/alertmanager.conf
 Alert Files
 /company/compose/infra-prom-stack/infra/prometheus/alerts
53
Basic Stack Deployment
The Makefile simplifies stack management by reducing error prone commands to simple make targets. It is used to both
configure and install the Monitoring Stack, and to manage the stack during runtime. Some examples:
 make pushconfigs-all
 Distributes configuration to all Swarm nodes
 make hup-prometheus
 Gently restarts Prometheus Server after a configuration change
 make start
 Equivalent to a `docker compose up` with cluster specific information
 make start-all
 Starts the stack and scales all required services
54
Controlling the Stack
These commands are run from the /company/compose/infra-prom-stack on the first Master Node
 There are 1 or more cAdvisor containers down
 Restart via UCP
 If that fails remove the stopped containers
 Run `make scale-cadvisor` from /company/compose/infra-prom-stack
 There are 1 or more node-exporter containers down
 Restart via UCP
 If that fails remove the stopped containers
 Run `make scale-node-exporter` from /company/compose/infra-prom-stack
 Cannot connect to Prometheus Server, Grafana, or Alertmanager
 Validate they are up via UCP
 Occasionally HAProxy seems to get confused and needs a simple restart via UCP
55
Fixing Some Basic Problems
56
Prometheus UCP View
57
Prometheus UCP View
Infrastructure Monitoring and Logging services are currently
deployed as shared infrastructure services in a Docker Overlay
network.
 Overlay name: infra_netmon
Monitoring stack
Logging stack
58
Network Overlay and Shared Services
Prometheus is Federated, enabling existing Prometheus Servers to monitor other Prometheus Servers.
 north-nonprod monitors both
 east
 west
 east monitors
 west
 west monitors
 east
 Basic synthetic monitoring
59
Federation
Who monitors the monitors?
If we stick with Prometheus then there are several improvements that will need exploration and engineering
 Integrate configuration and deployment via a CI/CD pipeline
 Improve and refine Rules/Alerts
 Update Prometheus Server to latest version
 Not much to gain here at the moment
 Update Grafana to latest version
 Some interesting new features including built in alerts
 Back Grafana with a relational database
 Enables persistent annotations
 Engineer HA Prometheus and Alertmanager within a cluster
 Figure out a better persistent storage strategy
 This is bigger than Prometheus/Monitoring
60
Future Work
Since this is an Open Source solution we will have new tradeoffs vs. a fully vendored solution. The following resources are suggested for those
wanting to dive deeper into this technology stack.
 See the Prometheus docs, GitHub repo, YouTube videos, and Robust Perception blog
 https://prometheus.io/docs/introduction/overview/
 https://github.com/prometheus/prometheus
 https://www.youtube.com/watch?v=gNmWzkGViAY&t
 https://www.robustperception.io/blog/
 See the Grafana docs, GitHub repo, and Screencasts
 http://docs.grafana.org/
 https://github.com/grafana/grafana
 https://www.youtube.com/playlist?list=PLDGkOdUX1Ujo3wHw9-z5Vo12YLqXRjzg2
 See the cAdvisor GitHub repo
 https://github.com/google/cadvisor
61
Want to Learn More?
 Microservices are (intended to be) ephemeral
 We need to monitor potentially transient services and act accordingly
 This is an Open Source solution down the stack
 Prometheus is targeted to replace existing on-prem roles
Capable of very basic synthetics
Can set up service level monitoring for mongodb, rabbitmq, etc(d).
 Interface with 3rd party connectors
 Alerts are easy to create and manage
 Deployed as Infrastructure as Code
Embrace DevOps
62
Key Points
Questions, Maybe Answers
63
64
I Hope This Isn’t You Right Now

Weitere ähnliche Inhalte

Was ist angesagt?

Monitoring microservices with Prometheus
Monitoring microservices with PrometheusMonitoring microservices with Prometheus
Monitoring microservices with PrometheusTobias Schmidt
 
[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOS[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOSAkihiro Suda
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)Lucas Jellema
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With PrometheusKnoldus Inc.
 
Using HashiCorp’s Terraform to build your infrastructure on AWS - Pop-up Loft...
Using HashiCorp’s Terraform to build your infrastructure on AWS - Pop-up Loft...Using HashiCorp’s Terraform to build your infrastructure on AWS - Pop-up Loft...
Using HashiCorp’s Terraform to build your infrastructure on AWS - Pop-up Loft...Amazon Web Services
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with PrometheusShiao-An Yuan
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusGrafana Labs
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaSridhar Kumar N
 
Présentation docker et kubernetes
Présentation docker et kubernetesPrésentation docker et kubernetes
Présentation docker et kubernetesKiwi Backup
 
Terraform -- Infrastructure as Code
Terraform -- Infrastructure as CodeTerraform -- Infrastructure as Code
Terraform -- Infrastructure as CodeMartin Schütte
 
Ansible presentation
Ansible presentationAnsible presentation
Ansible presentationSuresh Kumar
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusGrafana Labs
 
An introduction to terraform
An introduction to terraformAn introduction to terraform
An introduction to terraformJulien Pivotto
 

Was ist angesagt? (20)

Monitoring microservices with Prometheus
Monitoring microservices with PrometheusMonitoring microservices with Prometheus
Monitoring microservices with Prometheus
 
[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOS[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOS
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
 
02 terraform core concepts
02 terraform core concepts02 terraform core concepts
02 terraform core concepts
 
Prometheus and Grafana
Prometheus and GrafanaPrometheus and Grafana
Prometheus and Grafana
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Using HashiCorp’s Terraform to build your infrastructure on AWS - Pop-up Loft...
Using HashiCorp’s Terraform to build your infrastructure on AWS - Pop-up Loft...Using HashiCorp’s Terraform to build your infrastructure on AWS - Pop-up Loft...
Using HashiCorp’s Terraform to build your infrastructure on AWS - Pop-up Loft...
 
Nagios intro
Nagios intro Nagios intro
Nagios intro
 
Terraform
TerraformTerraform
Terraform
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
 
Présentation docker et kubernetes
Présentation docker et kubernetesPrésentation docker et kubernetes
Présentation docker et kubernetes
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Terraform -- Infrastructure as Code
Terraform -- Infrastructure as CodeTerraform -- Infrastructure as Code
Terraform -- Infrastructure as Code
 
Ansible presentation
Ansible presentationAnsible presentation
Ansible presentation
 
Github in Action
Github in ActionGithub in Action
Github in Action
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
Cloud Monitoring tool Grafana
Cloud Monitoring  tool Grafana Cloud Monitoring  tool Grafana
Cloud Monitoring tool Grafana
 
An introduction to terraform
An introduction to terraformAn introduction to terraform
An introduction to terraform
 

Ähnlich wie Prometheus Monitoring of Docker Containers

Weave User Group Talk - DockerCon 2017 Recap
Weave User Group Talk - DockerCon 2017 RecapWeave User Group Talk - DockerCon 2017 Recap
Weave User Group Talk - DockerCon 2017 RecapPatrick Chanezon
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Brian Brazil
 
The DevOps paradigm - the evolution of IT professionals and opensource toolkit
The DevOps paradigm - the evolution of IT professionals and opensource toolkitThe DevOps paradigm - the evolution of IT professionals and opensource toolkit
The DevOps paradigm - the evolution of IT professionals and opensource toolkitMarco Ferrigno
 
The DevOps Paradigm
The DevOps ParadigmThe DevOps Paradigm
The DevOps ParadigmNaLUG
 
Build cloud native solution using open source
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source Nitesh Jadhav
 
Monitoring Cloud Native Applications with Prometheus
Monitoring Cloud Native Applications with PrometheusMonitoring Cloud Native Applications with Prometheus
Monitoring Cloud Native Applications with PrometheusJacopo Nardiello
 
PeopleSoft Cloud Architecture - OpenWorld 2016
PeopleSoft Cloud Architecture - OpenWorld 2016PeopleSoft Cloud Architecture - OpenWorld 2016
PeopleSoft Cloud Architecture - OpenWorld 2016Graham Smith
 
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)QAware GmbH
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Brian Brazil
 
Building Your Docker Tech Stack
Building Your Docker Tech StackBuilding Your Docker Tech Stack
Building Your Docker Tech StackBret Fisher
 
Building your production tech stack for docker container platform
Building your production tech stack for docker container platformBuilding your production tech stack for docker container platform
Building your production tech stack for docker container platformDocker, Inc.
 
Docker Azure Friday OSS March 2017 - Developing and deploying Java & Linux on...
Docker Azure Friday OSS March 2017 - Developing and deploying Java & Linux on...Docker Azure Friday OSS March 2017 - Developing and deploying Java & Linux on...
Docker Azure Friday OSS March 2017 - Developing and deploying Java & Linux on...Patrick Chanezon
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Brian Brazil
 
Programming the world with Docker
Programming the world with DockerProgramming the world with Docker
Programming the world with DockerPatrick Chanezon
 
From nothing to Prometheus : one year after
From nothing to Prometheus : one year afterFrom nothing to Prometheus : one year after
From nothing to Prometheus : one year afterAntoine Leroyer
 
Open Source XMPP for Cloud Services
Open Source XMPP for Cloud ServicesOpen Source XMPP for Cloud Services
Open Source XMPP for Cloud Servicesmattjive
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Brian Brazil
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataGetInData
 
What's New in Docker - February 2017
What's New in Docker - February 2017What's New in Docker - February 2017
What's New in Docker - February 2017Patrick Chanezon
 

Ähnlich wie Prometheus Monitoring of Docker Containers (20)

Weave User Group Talk - DockerCon 2017 Recap
Weave User Group Talk - DockerCon 2017 RecapWeave User Group Talk - DockerCon 2017 Recap
Weave User Group Talk - DockerCon 2017 Recap
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
 
The DevOps paradigm - the evolution of IT professionals and opensource toolkit
The DevOps paradigm - the evolution of IT professionals and opensource toolkitThe DevOps paradigm - the evolution of IT professionals and opensource toolkit
The DevOps paradigm - the evolution of IT professionals and opensource toolkit
 
The DevOps Paradigm
The DevOps ParadigmThe DevOps Paradigm
The DevOps Paradigm
 
Build cloud native solution using open source
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source
 
Monitoring Cloud Native Applications with Prometheus
Monitoring Cloud Native Applications with PrometheusMonitoring Cloud Native Applications with Prometheus
Monitoring Cloud Native Applications with Prometheus
 
PeopleSoft Cloud Architecture - OpenWorld 2016
PeopleSoft Cloud Architecture - OpenWorld 2016PeopleSoft Cloud Architecture - OpenWorld 2016
PeopleSoft Cloud Architecture - OpenWorld 2016
 
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)
 
Building Your Docker Tech Stack
Building Your Docker Tech StackBuilding Your Docker Tech Stack
Building Your Docker Tech Stack
 
Building your production tech stack for docker container platform
Building your production tech stack for docker container platformBuilding your production tech stack for docker container platform
Building your production tech stack for docker container platform
 
Docker Azure Friday OSS March 2017 - Developing and deploying Java & Linux on...
Docker Azure Friday OSS March 2017 - Developing and deploying Java & Linux on...Docker Azure Friday OSS March 2017 - Developing and deploying Java & Linux on...
Docker Azure Friday OSS March 2017 - Developing and deploying Java & Linux on...
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
 
Docker 101 Checonf 2016
Docker 101 Checonf 2016Docker 101 Checonf 2016
Docker 101 Checonf 2016
 
Programming the world with Docker
Programming the world with DockerProgramming the world with Docker
Programming the world with Docker
 
From nothing to Prometheus : one year after
From nothing to Prometheus : one year afterFrom nothing to Prometheus : one year after
From nothing to Prometheus : one year after
 
Open Source XMPP for Cloud Services
Open Source XMPP for Cloud ServicesOpen Source XMPP for Cloud Services
Open Source XMPP for Cloud Services
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
 
What's New in Docker - February 2017
What's New in Docker - February 2017What's New in Docker - February 2017
What's New in Docker - February 2017
 

Kürzlich hochgeladen

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 

Kürzlich hochgeladen (20)

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 

Prometheus Monitoring of Docker Containers

  • 1. Prometheus Monitoring – Docker Enterprise Edition Tim Tyler – Docker Captain January 03, 2017 This is a training deck I originally developed in December 2016 and presented as part of a company training plan for the Docker Enterprise Edition platform. As of this edit it is 2019 and substantially dated – that is all of the tool stack has moved forward significantly as well as some of the nuts and bolts originally described here (for instance its very easy to not need HAProxy as described in favor of Interlock) I’ve decided to share it – as it does have some intrinsic value remaining and can form the basis for an updated and modernized version and potential MeetUp talk. I’ve removed about 10% of the original content that was company specific or proprietary, leaving only publicly available detail, and obscured some data. Many of the images worked better on a white background, and rather than fiddle too much with them I’ve just applied some quick picture styles. It would be very easy to base an updated tech stack on this document and install a portable training system on a Raspberry Pi. I am currently building a Prometheus and Grafana based system for monitoring, alerting, and visualizing my Samsung SmartThings home automation on a spare iMac. @timotyler ttyler
  • 2. 3 Who’s Keeping An Eye On Your Containers?
  • 3.  Monitoring Stack Overview  Prometheus  Exporters  Alertmanager  Queries  Alerts Nuts and Bolts Questions 4 Agenda
  • 4. Monitoring Stack Overview 5 I'm still passionately interested in what my fellow humans are up to. For me, a day spent monitoring the passing parade is a day well-spent. - Garry Trudeau
  • 5.  Monitoring containerized and microservices environments present new challenges.  Containers can be highly ephemeral  Microservervices are able to scale up and down to meet design and performance criteria  Microservices may exist for seconds, or persist indefinitely  Microservices are generally a single process  Containers live on hosts, but hosts are just pooled resources  Generally we don’t think about what host an application microservice is running on  Instances of a microservice may live on multiple hosts in a Docker Swarm  Instances of a microservice may move to different hosts within a Docker Swarm  The Swarm is a pool, and the microservices just swim in it  Monitoring, like the microservice architecture, needs to be elastic 6 What’s the Problem?
  • 6.  We have options, and several are readily available  Prometheus Time series dimensional data model with Docker aware agents  Dynatrace Specialist in application performance monitoring with Docker support  SignalFX Newer offering with native Docker support  Sysdig Swiss army knife for infrastructure and microservices monitoring 7 Can We Solve This?
  • 7.  Prometheus is a Pitts S-2A RC muscle biplane  Prometheus is a prequel and fifth installment in the Alien franchise  Prometheus is a Greek Titan that gave us fire and suffered an unfortunate fate involving a hungry eagle and his liver  Prometheus is a leading Open Source monitoring solution Prometheus is straightforward to implement as a primary cluster monitoring stack A complete stack can also include the Open Source data visualization tool Grafana 8 PROMETHEUS! What’s Prometheus?
  • 8.  Dimensional Data Metric Collector  Interactive Query Engine  Calculator for discrete multidimensional data streams  Great Visualization  Efficient Storage  Simple Operation  Alerting  Many Client Libraries  Many Integrations 9 PROMETHEUS! But what does it do?
  • 9.  Prometheus Server  Scrapes and stores time series data  Alertmanager  Handles alerts generated by Prometheus Server, deduplicating, grouping, and routing alerts to configured receivers  AM-Exporter  Receiver to transmit alerts from Alertmanager to custom intake process  Exporters  Agents with specific duties that collect metrics and present them to Prometheus Server  cAdvisor, node-exporter, blackbox-exporter  Grafana  Data visualization  HaProxy  Routes calls to Prometheus Server, Alertmanager, and Grafana within the Docker overlay network 10A typical Prometheus Monitoring Stack
  • 12.  Infrastructure as Code  IaC is to treat the configuration of systems the same way that software code is treated  We’re all devs now  Automate and modularize  Apply test pyramid  Version control changes, patches, and releases  Share work! (Because DevOps)  Installed via Docker orchestration and some basic automation  Makefile driven  Apply environment specific customizations (hostnames, passwords, alerts, etc.) to config files  Deploy configs across cluster 13 Stack Installation
  • 14. 15 Why should the thirst for knowledge be aroused, only to be disappointed and punished? Yet, like a second Prometheus, I will endure this and worse - Edwin Abbott in Flatland: A Romance of Many Dimensions (1884)
  • 15.  Open Source systems monitoring and alerting tool originally built at SoundCloud  Very active developer and user community  Docs and stuff https://prometheus.io/docs/introduction/overview/ 16 Prometheus Server
  • 16.  Collect and store time series data  Scrape defined targets for functionally specific data  Discover targets statically or dynamically  Evaluate rulesets  Allow vector arithmetic  Send alerts 17 What can Prometheus Server Do?
  • 17. 18 Prometheus Has a Really Boring UI We’ll go poke around for a minute
  • 19. Prometheus Exporters are basically agents that are responsible for collecting application specific, time series, metrics and presenting them via an API endpoint for Prometheus to collect. 20 What Are Exporters?
  • 20. Prometheus has support either directly, or via third parties, for dozens of exporters. Some tools have been directly instrumented to provide a Prometheus endpoint such as etcd, cAdvisor, Kubernetes, and Docker. Custom, business specific, exporters can be easily written in any language, however Go seems popular. 21 A Bunch of Exporters
  • 21. A basic Docker monitoring stack implements 3 exporters  cAdvisor  Provides metrics on docker and container environment  node-exporter  exporter for hardware and OS metrics exposed by the kernel  blackbox-exporter  allows blackbox probing of endpoints over HTTP, HTTPS, DNS, TCP and ICMP. 22 What is a Minimal Set of Exporters?
  • 22.  Pushgateway  allow ephemeral and batch jobs to expose their metrics  HAProxy-exporter periodically scrapes HAProxy stats and exports them via HTTP/JSON for Prometheus  JMX-exporter configurably scrape and expose mBeans of a JMX target  Mongodb-exporter  Rabbitmq-exporter 23 What are Some Other Exporters?
  • 27. Prometheus provides a functional language that lets the user select and aggregate time series data in real time. Results can be rendered as follows:  Displayed in a graph  Viewed as tabular data  Consumed by external systems Grafana for instance 28 The Basics
  • 28.  Instant vector A set of time series data containing a single sample for each series  Range vector A set of time series data containing a range of data points over time  Scalar A simple numeric floating point value 29 Data Types Prometheus has 3 basic data types
  • 29. 30 Operators Prometheus supports basic logical and arithmetic operators Arithmetic Operators Comparison Operators Aggregation Operators + (addition) - (subtraction) * (multiplication) / (division) % (modulo) ^ exponentiation) == (equal) != (not equal) > (greater than) < (less than) >= (greater or equal) <= (less or equal) sum min max avg count topk
  • 30.  sum  count  irate  sort  topk  time 31 Functions Prometheus supports about 40 built in functions
  • 32. sort_desc( topk(5, sum by (image) ( irate(container_cpu_usage_seconds_total { id=~"/docker/.*"}[5m] ) ) ) ) 33 To edit go to: Insert > Header and Footer Fancy Query Top 5 Docker Images by CPU
  • 34. 35 Big things have small beginnings – David, from the movie Prometheus (2012) Lets build an Alert!
  • 35.  Alerts are just queries with comparison operators  Alerts are written in a simple format in a plain text file  Alerts can be decorated with interesting metadata  Alert metadata can be templated  Alerts can be sent to an external service 36 First Things First
  • 36. 37 The Anatomy of an Alert An alert starts with a Query – like up
  • 37. 38 The Anatomy of an Alert This is more info than we want though
  • 38. 39 The Anatomy of an Alert What we really want is to count how many we have
  • 39. 40 The Anatomy of an Alert Or change how we count them
  • 40. 41 The Anatomy of an Alert And do some math
  • 41. 42 The Anatomy of an Alert Check out a quick chart
  • 42. 43 The Anatomy of an Alert This is more fun
  • 43. ALERT NodeDown IF up{job="node"} == 0 FOR 1m LABELS {prdcode=“0000", host=“Shared_Infra", severity="critical", support="Prometheus_Critical"} ANNOTATIONS { description="{{$labels.instance}} of job {{$labels.job}} has been down for more than 1 minutes.", rosguide="Please see Application ROS guide", summary="Instance {{$labels.instance}} down“ } 44 The Anatomy of an Alert And go back and turn our earlier query into an alert
  • 44.  OptimusPrime (bot)3:32 PM  AlertManager message: [FIRING:1] NodeDown (0000 prod node.metrics Shared_Infra node app critical Prometheus_Critical). Learn more at https://somewhere.dockeralerts.company.com:8443/#/alert s?receiver=ChatBot 45 A NodeDown Alert Sent To Chat Fate rarely calls on us at a moment of our choosing – Optimus Prime
  • 45.  Rules/Alerts are segregated into functionally specific rule files  alert.rules  basic alert installed with 1 rule ‘IF up{job="node"} == 0’  alert.infra.logging.rules  Logging ruleset  alert.infra.monitoring.rules  Monitoring stack rules  alert.infra.rules  Basic infrastructure rules such as file systems, memory, and thinpool  alert.service.app.prod.rules  Service level rules such as redis, mongodb, rabbitmq, etc.  alert.docker.rules  Rules for Docker itself  alert.0000.app.rules  Application specific rules 46 How are Rules/Alerts Categorized?
  • 47.  Grafana is a leading Open Source Data Visualization Tool  Create and share intuitive dashboards  Rich graphing and charting  Mixed styling within a dashboard  Dashboard templates  Lots of additional features 48 What is Grafana?
  • 48. 49
  • 50. The Infrastructure Monitoring Stack is currently considered v1.0  Prometheus v1.3.1  Grafana v3.1.1  Alertmanager v0.4.2 custom-v2  HaProxy v1.6.9  cAdvisor v0.24.1  Node-exporter v0.12.0  Blackbox-exporter v0.2.0 51 Whats in Your Stack?
  • 51. We use Git to manage configurations and changes to the tech stack. Git is a distributed version control system.  Simple to use  Enables code collaboration  Eases deployments  https://somewhere.company.com/git/projects/PRJ0000/repos/infra-prom- stack/browse 52 Tech Stack SOA
  • 52. The Monitoring Stack is deployed and configured from 1 location in each Docker Swarm, this is typically on the first Docker Master Node.  Configuration files  /company/compose/infra-prom-stack  Prometheus Configuration  /company/compose/infra-prom-stack/infra/prometheus/config/prometheus.yml  Alertmanager Configuration  /company/compose/infra-prom-stack/infra/prometheus/config/alertmanager.conf  Alert Files  /company/compose/infra-prom-stack/infra/prometheus/alerts 53 Basic Stack Deployment
  • 53. The Makefile simplifies stack management by reducing error prone commands to simple make targets. It is used to both configure and install the Monitoring Stack, and to manage the stack during runtime. Some examples:  make pushconfigs-all  Distributes configuration to all Swarm nodes  make hup-prometheus  Gently restarts Prometheus Server after a configuration change  make start  Equivalent to a `docker compose up` with cluster specific information  make start-all  Starts the stack and scales all required services 54 Controlling the Stack
  • 54. These commands are run from the /company/compose/infra-prom-stack on the first Master Node  There are 1 or more cAdvisor containers down  Restart via UCP  If that fails remove the stopped containers  Run `make scale-cadvisor` from /company/compose/infra-prom-stack  There are 1 or more node-exporter containers down  Restart via UCP  If that fails remove the stopped containers  Run `make scale-node-exporter` from /company/compose/infra-prom-stack  Cannot connect to Prometheus Server, Grafana, or Alertmanager  Validate they are up via UCP  Occasionally HAProxy seems to get confused and needs a simple restart via UCP 55 Fixing Some Basic Problems
  • 57. Infrastructure Monitoring and Logging services are currently deployed as shared infrastructure services in a Docker Overlay network.  Overlay name: infra_netmon Monitoring stack Logging stack 58 Network Overlay and Shared Services
  • 58. Prometheus is Federated, enabling existing Prometheus Servers to monitor other Prometheus Servers.  north-nonprod monitors both  east  west  east monitors  west  west monitors  east  Basic synthetic monitoring 59 Federation Who monitors the monitors?
  • 59. If we stick with Prometheus then there are several improvements that will need exploration and engineering  Integrate configuration and deployment via a CI/CD pipeline  Improve and refine Rules/Alerts  Update Prometheus Server to latest version  Not much to gain here at the moment  Update Grafana to latest version  Some interesting new features including built in alerts  Back Grafana with a relational database  Enables persistent annotations  Engineer HA Prometheus and Alertmanager within a cluster  Figure out a better persistent storage strategy  This is bigger than Prometheus/Monitoring 60 Future Work
  • 60. Since this is an Open Source solution we will have new tradeoffs vs. a fully vendored solution. The following resources are suggested for those wanting to dive deeper into this technology stack.  See the Prometheus docs, GitHub repo, YouTube videos, and Robust Perception blog  https://prometheus.io/docs/introduction/overview/  https://github.com/prometheus/prometheus  https://www.youtube.com/watch?v=gNmWzkGViAY&t  https://www.robustperception.io/blog/  See the Grafana docs, GitHub repo, and Screencasts  http://docs.grafana.org/  https://github.com/grafana/grafana  https://www.youtube.com/playlist?list=PLDGkOdUX1Ujo3wHw9-z5Vo12YLqXRjzg2  See the cAdvisor GitHub repo  https://github.com/google/cadvisor 61 Want to Learn More?
  • 61.  Microservices are (intended to be) ephemeral  We need to monitor potentially transient services and act accordingly  This is an Open Source solution down the stack  Prometheus is targeted to replace existing on-prem roles Capable of very basic synthetics Can set up service level monitoring for mongodb, rabbitmq, etc(d).  Interface with 3rd party connectors  Alerts are easy to create and manage  Deployed as Infrastructure as Code Embrace DevOps 62 Key Points
  • 63. 64 I Hope This Isn’t You Right Now