Pierre Vincent gives a presentation on increasing visibility of distributed systems in production. He discusses hierarchy of service reliability, designing systems for recovery from failures, and how distributing a system also distributes places where failures can occur. He covers strategies for health checks, collecting system, application, and business metrics using tools like Prometheus. Vincent emphasizes making metrics and logs usable and limiting alerts to focus on user-impacting issues. Tracing is discussed as a way to correlate errors across services. In closing, Vincent notes that visibility enables operability, justifiable decisions, and builds trust in systems.
12. Usability of metrics tooling is key to adoption
Instrument
code
Query
metrics
Create
dashboards
Define rules
& thresholds
@PierreVincent
13. Limit alerting to user-impacting symptoms
Expose dashboards to diagnose causes
@PierreVincent
14. Overlaying changes with production metrics
Source: Ian Malpass (Etsy), Measure Anything, Measure Everything
https://codeascraft.com/2011/02/15/measure-anything-measure-everything@PierreVincent
15. Making sense of logs
Centralise
logs
Common
searchable
format
Correlation
IDs
@PierreVincent
16. Tracing
A
F
H
D
J
B
E
C
G
a1b2c3
a1b2c3
a1b2c3
ERROR [svc=H][trace=a1b2c3] Failed to save order
Cause: Cassandra timeout exception
ERROR [svc=F][trace=a1b2c3] Failed to complete order
Cause: Shipping service responded with 500
ERROR [svc=A][trace=a1b2c3] Failed to process order
Cause: Order process manager responded with 500
a1b2c3
INFO [svc=G][trace=a1b2c3] Items verified in stock
@PierreVincent
21. If you can’t monitor a service, you don’t know
what’s happening, and if you’re blind to what’s
happening, you can’t be reliable.
“ ”
N. Murphy, J. Petoff, C. Jones, B. Beyer
Site Reliability Engineering
@PierreVincent
Maslows hierarchy of needs: food > safety > love > esteem > fulfilment
Reliability:
monitoring: see how things are working and get notified when they’re not
Incident resp: once we’re notified, how to we mitigate (turn off feature / add capacity)
postmortem/RCA: what went wrong, how do we fix it durably
Testing/RP: test what tends to go wrong, to catch things before
CP: understanding load, dynamically balacing load, circuit breaking etc
Dev: design system for reliability, on where things tend to be brittle
Product: fulfilment of a reliable product
“Monitoring enables service owners to make rational decisions about the impact of changes to the service, apply the scientific method to incident response, and of course ensure their reason for existence: to measure the service’s alignment with business goals”
We used to spend most of the time code and not testing, then came TDD - not unit testing is the widely agreed outlier
But are we still spending most of our time developing?
Apps that haven’t reached production = just playing around
Production is the real deal, but we see it as the finish line = it’s the opposite
Things will never run perfectly.
If nothing goes wrong in your system either:
Nobody is using it
You just don’t know about it
There is only so much we can think about.
Diminishing returns in designing/coding perfection - much more value for money in admitting that things will go wrong, and that in these cases our focus is to:
Know about it asap
Have as much info as possible to find the source of the problem
What I mean by DS:
No longer 1 web app supported by 1 db
Number of separate parts, responsible for different things, talk to each other over the network
Independently scalable, independently deployable, independently failing
Clusters of databases, messages queues
Multi servers, DCs, clouds
Nobody knew distributing would be so complicated! ;)
Simple network broadcast
Registration
Heartbeat in a HA store (etcd, zk…)
Expose health to others
E.g. Http endpoint
Requires some form of service discovery
All levels of metrics are important
Different teams might be responsible for different things
Not exclusive levels
Need ability to correlate different levels
If adding metrics is simple, every developer will do it
1 line instrumentation with tools like Prometheus or StatsD
Integration with graphing tools, alerting tools
Not going to expand on alerting - entire (multiple) talks required!
Alerting on symptoms reduces noise
> 1st action is to mitigate effects, then track down the cause
Use dashboards to troubleshoot
> 1st place to go to validate theories
Problems mostly caused by changes
Overlay production changes with time-series
Deployment/config change > correlate with change in system behaviour
Aggregated logs are just more logs in one place
Need to make sense of it
Correlation ids, tracing
Search for a trace / Timing of traces
This is profiling on a live environment!
Example of DNS issue tracked:
- Error rate of peer dependency went up
- Tracked down to breach of our SLO on API
- Request to particular dependency was slow, but no evidence of that dependency to be slow to respond
- Monitoring disproved dependency from being slow to respond
- Pointed at something between the 2 services
- Added internal zipkin tracing inside the call service
- Tracked down to slow DNS look up because of bad resolv configuration
Having fuller picture = less guess work
Impossible to reason about a system when flying blind
Monitoring allows to adopt a scientific approach to explain production systems
> find evidence of problems
> make hypotheses on issues
> correlate issues with recent changes
> prove/disprove hypotheses
Shining a light on your system gives you the real picture
Internal changes backed up by guess work is an anti pattern
Will make things more complicated without backing it up
No way to quantify if things get better or worse
Encourage “information radiators”
= No hiding (from other and from ourselves)
> Needs culture of safety
Distilled dashboards and status pages for other parts of the business
= spread visibility for higher up (e.g. support)
= build confidence and trust of stakeholders