Monitoring as an entry point for collaboration

Monitoring as an entry point for
collaboration
Julien Pivotto (@roidelapluie)
DevOpsDays Geneva
February 22nd, 2019

@roidelapluie
I like Open Source
I like monitoring
I like automation
... and all of that is my daily job at inuits

This talk is based on experience. Therefore we will
talk about the Prometheus ecosystem, but it applies
to other workflows and tools.

The DevOps principles:
CAMS
(a definition of DevOps)
Culture
Automation
Measurement
Sharing
(Damon Edwards and John Willis, 2010 http://devopsdictionary.com/wiki/CAMS)
This talk is about all of it..

Who is behind the magic
Dev Ops Security Virtualization QA Networking
Sales Customers Partners ...

Monitoring
Creative Commons Attribution 2.0 https://www.flickr.com/photos/24375810@N06/3719090065

Creative Commons Attribution ShareAlike 2.0 https://www.flickr.com/photos/grendelkhan/400428874

Creative Commons Public Domain https://pxhere.com/en/photo/265717

Traditional Monitoring
It works - OK
It does not work - CRITICAL
It kinda works - WARNING
I don't know - UNKNOWN

Creative Commons Public Domain https://pxhere.com/fr/photo/952999

Creative Commons Attribution 2.0 https://www.flickr.com/photos/wwarby/2460655511

Creative Commons Attribution-Share Alike 3.0 Unported
https://commons.wikimedia.org/wiki/File:CUPE3903-picketLine-20180504.jpg

Real world
It works ; it does not work ; it kinda works ; it maybe
works ; no one uses it ; it is broken ; some things
are broken ; it should work but it does not ; where
are my users? help me...

The Technical bias
By looking at technical service, we miss the
actual point
Are we serving our users correctly?
Just looking at the traffic light will not tell you about
the traffic jams.

Further questions
At which speed are the cars running?
How long do they stop?
How many pedestrians are crossing the road?

Observability

Observability is the ability to be inside the
application, and look around to observe its world.
In practice:
Collecting relevant information
Making it available quickly and easily

Metrics
Creative Commons Attribution-Share Alike 2.0 https://www.flickr.com/photos/tillwe/11892564676/

Metric
Name
Labels (Key-Value Pairs)
Value (Number)
Timestamp
Fetched at a high frequency

Name: Number of HTTP requests
Labels:
status: 200
vhost: inuits.eu
method: post
Value: 1823
Timestamp: Thu Oct 18 10:18:06 CEST 2018

Name: Number of HTTP requests
Labels:
status: 200
vhost: inuits.eu
method: post
Value: 2123
Timestamp: Thu Oct 18 10:18:36 CEST 2018

300 Requests in 30 s = 10 requests per seconds
(POST for inuits.eu with response code 200)

http_request_total{job="fe",instance="fe1",code="200"}

Types of metrics
Counters
Gauges
Histograms
Summaries

Counters
Always go up
start from zero
rate, increase
e.g. number of http requests

Gauges
Go up and down
Average, Sum, Max, ...
^ over time
e.g. concurrent users

Histograms and summaries
Sets of requests
Using "buckets"
Useful to get duration, percentiles, SLA

Metrics and monitoring
Metrics do not represent problems
Metrics represent a state, give insights
Metrics can be graphed
You can alert based on them

Exposed metrics are "raw"
In general you can just expose counters, and let the
monitoring server do the real maths.
That keeps the overhead very low of apps.

Tooling
Creative Commons Attribution 2.0 https://www.flickr.com/photos/psd/5298483

What are the needs ?
Ingest metrics at high frequency
React to changes
Empower people
Alert on metrics

Use one toolchain
Creative Commons Attribution-ShareAlike 2.0
https://www.flickr.com/photos/161054138@N08/37880775085

Stop with:
Having 1 "monitoring" + 1 "graphing" stacks
Big all in one tools: think decentralize, scale
Auto Discovery (use service discovery instead)
Manual config
Fragile monitoring (think HA)

Prometheus
https://prometheus.io/

Prometheus
Open Source monitoring tool
Complete Ecosystem
For cloud and on prem
Built around metrics

Cloud Native
Easy to configure, deploy, maintain
Designed in multiple services
Container ready
Orchestration ready (dynamic config)
Fuzziness

Efficient
"Scrapes" millions of metrics
Scales
Manages its own optimized db
(prometheus/tsdb)

Pull vs Push
Prometheus pulls metrics
But does not know what it will get!
The target decides what to expose
(short term batches can still push to a
"pushgateway")

Exporters
Expose metrics with an HTTP API
Bindings available for many languages (for
"native" metrics)
Exporters do not save data ; they are not
"proxies" and don't "cache" anything

Common exporters
Node Exporter: Linux System Metrics
Grok Exporter: Metrics from log files
SNMP Exporter: Network devices
Blackbox exporter: TCP, DNS, Http requests

Grafana
Open Source (Apache 2.0)
Web app
Specialized in visualization
Pluggable
Multiple datasources: prometheus, graphite,
influxdb...
Has an API!

Grafana and Prometheus
Prometheus shipped its own consoles
Now it recommends Grafana and deprecated
its own consoles

Business Metrics

What are business metrics?
Metrics that effectively tell you how you fullfil
your customers' requests
Provide quality and level of service to
customers

CPU usage is no money
https://www.flickr.com/photos/nox_noctis_silentium/3960497840

Where to get them?
Frontends
Databases
Caching systems (sessions, ...)
...
Each one of them requires a cross-team
understanding of the business.

Where to start?
Creative Commons Attribution 2.0 https://www.flickr.com/photos/franckmichel/16265376747/

USE
Brendan Gregg's USE method
U = Utilisation S = Saturation E = Errors
For resources like network, CPU, memory,...
Also asynchrone processes, ...

RED
Tom Wilkie's RED method
R = Requests E = Errors D = Duration
HTTP Requests, synchrone processes,...

What to get?
Request Rate
Saturation
Error Rate
Duration

Before we dig in ..
What we will see now is monitoring data. It should
not be used for precise usages, like invoicing.

Caching System Monitoring
(USE)

What do we learn?
Users can connect to the platform: The
authentication works
The platform is currently used

Benefits
Connected users = they can use the platform
Know when you can do maintenances
Know about your user's general habits (trends)

Database
Using SQL exporters to query the data from
your database
Requires a cross team approach
Gets you fine grained, quality data

Database trap
Do not try to replace BI/Reporting
Do not take too many labels -- stay in the
monitoring area

sum by (instance, env) (
rate(http_requests_duration_count[5m])
)

sum by (code, env) (
rate(http_requests_duration_count{code!="200"}[5m])
) / ignoring (code) group_left
sum by (env) (
)

sum by (env) (
rate(http_requests_duration_sum[5m])
) /
sum by (env) (
)

What can we learn?
We have traffic from outside
How much traffic
Quality of the trafic
How long it really takes (end to end)

Networking
Utilisation
Saturation
Errors
Multicast
Broadcast
Use aliases to identify ports - human readable

Adding Time
Creative Commons Attribution-ShareAlike 2.0 https://www.flickr.com/photos/rswilson74/3375654385

Timeseries
How we use time: We take the metrics for the
last 7 weeks
We take the median value (exclude 3 top and 3
low)
Excludes anomalies due to incidents/holidays...

http_requests:rate5m offset 1w
offset queries data in the past

- record: past_request_rate
expr: http_requests:rate5m offset 1w
labels:
when: 1w

- record: past_request_rate
expr: http_requests:rate5m offset 2w
labels:
when: 2w

max without(when) (
bottomk(1,
topk(4,
past_request_rate
)
)
)

What do we learn?
Predict users habits
Deviation from the norm are not normal
It means that users can not reach us/use our
services

Why business metrics
matter?
Good service depends on: linux health, dns,
network, ntp, disk space, cpu, open files, database,
cache systems, load balancers, partners, electricity,
virtualization stack, nfs, ... and it moves over time
Customers won't call you because your disk is full!

Partners
Creative Commons Attribution 2.0 https://www.flickr.com/photos/deanhochman/27248626739

Given that the End User matters
We have decided to standadize metrics
exchange between partners
Prometheus format used (soon to be
OpenMetrics)
Everyone knows HTTP!

What do we exchange?
We are not interested in partner's internal (and
don't want to expose us)
We are exchanging precomputed metrics (rate
over 5 minutes, duration over 5 minutes),
excluding servers, instances, ...
Identify, in the chain, the bottlenecks and the
issues

Kind of dashboards
General (multiple business)
Business overview (e.g. one app)
Business focused (e.g. one process)
Technical overview (e.g. linux cluster)
Technical focus (e.g. linux host)
Even fore focused (e.g. cpu usage)

Dashboards
We define our business dashboards in two parts:
10 graphes on top about the business: RED,
USE, Alerts, data from partners, monitoring
robots, state of the monitoring
hidden by default: Technical Health - ntp, disk,
db, network, jvm, ...

Limited number of graphes
Errors in RED
Attention points in Yellow/Orange

technical view; more graphes; empty when OK

Dashboards
Duplicate some dashboards to compare to an
historical view. Especially when dashboard specific
with business patterns not easy to remember.

Dashboards
Easy drill down between dashboards / with pre
defined variables

Dashboard
Provide relevant help where needed
(from the haproxy documentation)

Dashboards
On product launch / change / ... extract
relevant data from the service and build a
"temporary dashboard"
Share with the teams and managers, show on
big screen

Conventions
Color conventions, general:
RED = Bad
Yellow = Attention
Blue/Green = OK
Also:
RED = problem at our side
Yellow/orange = problem at partners side

HTTP Codes
2xx: Greens
3xx: Yellows
4xx: Blues (404: grey)
5xx: Orange/Red
Same accross all dashboards to enable quick/easy
reading.

This is not only cross teams
Newcomers
People passing by or not actively looking
On-Call
During incidents .. lots of people
For those reasons, keep your dashboards simple
and intuitive!

Side note: github.com/grafana/grafonnet-lib/
A Jsonnet Library to write grafana dashboards.

Alerting
Creative Commons Attribution 2.0
https://www.flickr.com/photos/calliope/234447967

How to do alerting right
Use multiple channels (chat, tickets)
Alert when really needed (non prod: BH)
Send the alert to the right people (incl.
partners)
Make the alerts actionnable

Crisis
Major incident in production
Affecting multiple projects
"Situation room": 2 channels: 1 for all the
alerts, 1 for the people
Bring managers, and all the relevant tech
people in the same room
Unique channel of communication for the
incident (archived after the incident)

Conclusion
https://www.flickr.com/photos/willy_photoshop/34829332342/

Quick Answers
Business monitoring allows yo to know early
when things are wrong, accross teams
Provides clear asnwers to your customers in
minutes (no more "I don't know, I will check")
// to make between technical and business
metrics (to find causes)

What happened?
Is it REALLY fixed?
When?
Until when (technical and business)?
What did I miss? What is the impact?

Metrics benefits
Because you run queries and alerts from a
central location
You can run queries accross targets/jobs
Detect faulty instances, alert for server X
based on metrics of server Y

Metrics benefits
Trends
Dynamic thresholds
Predictions

Do not underestimate the monitoring of the
development / staging environments.

Business metrics are good
candidates to wake up someone at
night.
The downside is that that person must be fluent
with the business.

Prometheus benefits
Pull Based , metrics centrincs
The targets (e.g. developers) choose the
metrics they expose => Empowering people
HTTP permits TLS, Client Auth, ... and cross
org sharing of metrics
Becoming a standard in the industry

Grafana
Central point for all teams
Show current and past status
Should give you the opportunity to answer
questions

Focusing on Business Metrics is hard work that will
show benefits accross teams and provide visibility
towards hierarchy, enabling you to gain trust and
move on more quickly towards a DevOps model.

Julien Pivotto
roidelapluie
roidelapluie@inuits.eu
Inuits
https://inuits.eu
info@inuits.eu
Contact

Monitoring as an entry point for collaboration

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Monitoring as an entry point for collaboration

Ähnlich wie Monitoring as an entry point for collaboration (20)

Mehr von Julien Pivotto

Mehr von Julien Pivotto (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Monitoring as an entry point for collaboration