This document discusses different approaches to monitoring systems from manual and reactive to proactive monitoring using container orchestration tools. It provides examples of metrics to monitor at the host/hardware, networking, application, and orchestration layers. The document emphasizes applying the principles of observability including structured logging, events and tracing with metadata, and monitoring the monitoring systems themselves. Speakers provide best practices around failure prediction, understanding failure modes, and using chaos engineering to build system resilience.
4. Manual
● User initiated
● Interactive, command-line tools, simple scripts
● Checklist and process driven
Reactive
● Hardware-centric data collection
● Simple metric and log collection
● Siloed tools and information
● Manual analysis and remediation
Proactive
● Application-centric data collection
● End-to-end observability
● Key metrics and thresholds well understood
● Semi-automated analysis and remediation
7. The ‘What’
Blackbox monitoring — that is, monitoring a system from
the outside by treating it as a blackbox — is something I
find very good at answering the what and for alerting
about a problem that’s already occurring (and ideally end
user-impacting).
Cindy Sridharan
Engineer @ Apple
12. The USE Model
For every resource, check Utilization, Saturation, and Errors.
Resource: all physical server functional components (CPUs,
disks, busses, ...)
Utilization: the average time that the resource was busy
servicing work
Saturation: the degree to which the resource has extra work
which it can't service, often queued
Errors: the count of error events
Brendan Gregg
Performance Engineer @ Netflix
17. Evolving Workloads
As highly available cloud native infrastructure and
application workloads become more prevalent, more
care needs to be taken to get the monitoring systems
right, and to be sure that you are using dependable
metrics to dynamically manage your environments.
Adrian Cockroft
VP Cloud Architecture @ AWS
20. The RED Model
Measure, for every microservice in your architecture:
(Request) Rate: the number of requests, per second, you
services are serving.
(Request) Errors: the number of failed requests per second.
(Request) Duration: distributions of the amount of time each
request takes.
Tom Wilkie
VP Product @ Grafana (Prev. @ Weaveworks)
44. Metadata / Context
[Google has a] concept called tags. Tags are arbitrary key-value
pairs we propagate all across the stack. Tags are propagated
from top to very bottom, and each layer can add more to add to
the context.
Tags often carry the originator library name, originator RPC
name, etc. Once we retrieve instrumentation data from the
low-end services, we can easily filter and point out what
specific services, libraries or RPCs contributed to the state
of the things.
Jaana B. Dogan
Engineer @ Google
46. Chaos Engineering
"Chaos Engineering is the discipline of experimenting on
a distributed system in order to build confidence in the
system’s capability to withstand turbulent conditions in
production."
… from http://principlesofchaos.org/
Lorin Hochstein
Chaos Engineering @ Netflix
48. Monitoring the Monitoring
The first thing that would be useful is to have a
monitoring system that has failure modes which are
uncorrelated with the infrastructure it is monitoring. For
efficiency it is common to co-locate a monitoring system
with the infrastructure, in the same datacenter or cloud
region, but that sets up common dependencies that
could cause both to fail together.
Adrian Cockroft
VP Cloud Architecture @ AWS