This document summarizes a talk on using monitoring as an entry point for collaboration. It discusses using the Prometheus monitoring system to collect metrics and expose them using exporters. Grafana is then used to visualize the metrics and create dashboards focused on business metrics like requests, errors, and durations. These metrics provide observability across teams and enable alerting when business services are impacted.
4. This talk is based on experience. Therefore we will
talk about the Prometheus ecosystem, but it applies
to other workflows and tools.
5. The DevOps principles:
CAMS
(a definition of DevOps)
Culture
Automation
Measurement
Sharing
(Damon Edwards and John Willis, 2010 http://devopsdictionary.com/wiki/CAMS)
This talk is about all of it..
6. Who is behind the magic
Dev Ops Security Virtualization QA Networking
Sales Customers Partners ...
16. Real world
It works ; it does not work ; it kinda works ; it maybe
works ; no one uses it ; it is broken ; some things
are broken ; it should work but it does not ; where
are my users? help me...
17. The Technical bias
By looking at technical service, we miss the
actual point
Are we serving our users correctly?
Just looking at the traffic light will not tell you about
the traffic jams.
18. Further questions
At which speed are the cars running?
How long do they stop?
How many pedestrians are crossing the road?
20. Observability is the ability to be inside the
application, and look around to observe its world.
In practice:
Collecting relevant information
Making it available quickly and easily
31. Metrics and monitoring
Metrics do not represent problems
Metrics represent a state, give insights
Metrics can be graphed
You can alert based on them
32. Exposed metrics are "raw"
In general you can just expose counters, and let the
monitoring server do the real maths.
That keeps the overhead very low of apps.
34. What are the needs ?
Ingest metrics at high frequency
React to changes
Empower people
Alert on metrics
35. Use one toolchain
Creative Commons Attribution-ShareAlike 2.0
https://www.flickr.com/photos/161054138@N08/37880775085
36. Stop with:
Having 1 "monitoring" + 1 "graphing" stacks
Big all in one tools: think decentralize, scale
Auto Discovery (use service discovery instead)
Manual config
Fragile monitoring (think HA)
46. Pull vs Push
Prometheus pulls metrics
But does not know what it will get!
The target decides what to expose
(short term batches can still push to a
"pushgateway")
47. Exporters
Expose metrics with an HTTP API
Bindings available for many languages (for
"native" metrics)
Exporters do not save data ; they are not
"proxies" and don't "cache" anything
48. Common exporters
Node Exporter: Linux System Metrics
Grok Exporter: Metrics from log files
SNMP Exporter: Network devices
Blackbox exporter: TCP, DNS, Http requests
49.
50. Grafana
Open Source (Apache 2.0)
Web app
Specialized in visualization
Pluggable
Multiple datasources: prometheus, graphite,
influxdb...
Has an API!
53. What are business metrics?
Metrics that effectively tell you how you fullfil
your customers' requests
Provide quality and level of service to
customers
54. CPU usage is no money
Creative Commons Attribution-ShareAlike 2.0
https://www.flickr.com/photos/nox_noctis_silentium/3960497840
55. Where to get them?
Frontends
Databases
Caching systems (sessions, ...)
...
Each one of them requires a cross-team
understanding of the business.
56. Where to start?
Creative Commons Attribution 2.0 https://www.flickr.com/photos/franckmichel/16265376747/
57. USE
Brendan Gregg's USE method
U = Utilisation S = Saturation E = Errors
For resources like network, CPU, memory,...
Also asynchrone processes, ...
58. RED
Tom Wilkie's RED method
R = Requests E = Errors D = Duration
HTTP Requests, synchrone processes,...
72. sum by (code, env) (
rate(http_requests_duration_count{code!="200"}[5m])
) / ignoring (code) group_left
sum by (env) (
rate(http_requests_duration_count[5m])
)
78. Timeseries
How we use time: We take the metrics for the
last 7 weeks
We take the median value (exclude 3 top and 3
low)
Excludes anomalies due to incidents/holidays...
85. What do we learn?
Predict users habits
Deviation from the norm are not normal
It means that users can not reach us/use our
services
86. Why business metrics
matter?
Good service depends on: linux health, dns,
network, ntp, disk space, cpu, open files, database,
cache systems, load balancers, partners, electricity,
virtualization stack, nfs, ... and it moves over time
Customers won't call you because your disk is full!
88. Given that the End User matters
We have decided to standadize metrics
exchange between partners
Prometheus format used (soon to be
OpenMetrics)
Everyone knows HTTP!
89. What do we exchange?
We are not interested in partner's internal (and
don't want to expose us)
We are exchanging precomputed metrics (rate
over 5 minutes, duration over 5 minutes),
excluding servers, instances, ...
Identify, in the chain, the bottlenecks and the
issues
91. Kind of dashboards
General (multiple business)
Business overview (e.g. one app)
Business focused (e.g. one process)
Technical overview (e.g. linux cluster)
Technical focus (e.g. linux host)
Even fore focused (e.g. cpu usage)
92. Dashboards
We define our business dashboards in two parts:
10 graphes on top about the business: RED,
USE, Alerts, data from partners, monitoring
robots, state of the monitoring
hidden by default: Technical Health - ntp, disk,
db, network, jvm, ...
93. Limited number of graphes
Errors in RED
Attention points in Yellow/Orange
95. Dashboards
Duplicate some dashboards to compare to an
historical view. Especially when dashboard specific
with business patterns not easy to remember.
100. Dashboards
On product launch / change / ... extract
relevant data from the service and build a
"temporary dashboard"
Share with the teams and managers, show on
big screen
102. HTTP Codes
2xx: Greens
3xx: Yellows
4xx: Blues (404: grey)
5xx: Orange/Red
Same accross all dashboards to enable quick/easy
reading.
103. This is not only cross teams
Newcomers
People passing by or not actively looking
On-Call
During incidents .. lots of people
For those reasons, keep your dashboards simple
and intuitive!
106. How to do alerting right
Use multiple channels (chat, tickets)
Alert when really needed (non prod: BH)
Send the alert to the right people (incl.
partners)
Make the alerts actionnable
107. Crisis
Major incident in production
Affecting multiple projects
"Situation room": 2 channels: 1 for all the
alerts, 1 for the people
Bring managers, and all the relevant tech
people in the same room
Unique channel of communication for the
incident (archived after the incident)
110. Quick Answers
Business monitoring allows yo to know early
when things are wrong, accross teams
Provides clear asnwers to your customers in
minutes (no more "I don't know, I will check")
// to make between technical and business
metrics (to find causes)
111. What happened?
Is it REALLY fixed?
When?
Until when (technical and business)?
What did I miss? What is the impact?
112. Metrics benefits
Because you run queries and alerts from a
central location
You can run queries accross targets/jobs
Detect faulty instances, alert for server X
based on metrics of server Y
115. Business metrics are good
candidates to wake up someone at
night.
The downside is that that person must be fluent
with the business.
116. Prometheus benefits
Pull Based , metrics centrincs
The targets (e.g. developers) choose the
metrics they expose => Empowering people
HTTP permits TLS, Client Auth, ... and cross
org sharing of metrics
Becoming a standard in the industry
117. Grafana
Central point for all teams
Show current and past status
Should give you the opportunity to answer
questions
118. Focusing on Business Metrics is hard work that will
show benefits accross teams and provide visibility
towards hierarchy, enabling you to gain trust and
move on more quickly towards a DevOps model.