3. "What happens when a software engineer is tasked with what used to
be called operations."
» Ben Treynor Sloss, Vice President, Google Engineering,
founder of Google SRE
3
11. No team separation
» bounder context, but ...
» Shared ownership of platform - shared responsibility
» Shared tooling (debug, deploy, monitor)
» Shared codebase
» Brainstorm
» Motivation for good design (monitoring, future debugging)
11
12. Things break
» They do - better be ready
» Knowing when there's problem (logs, metrics, alerting)
» Having someone there - being oncall
» Responding (mitigation, resolution)
» Learning from it (postmortems)
12
13. Measure everything
» No gut feeling when we have the data (app metrics, runtime
metrics)
» Both production and non-production systems (e.g. our CI test
time)
» Thresholds, automated alerting
» Visualize the data (oncall dashboard, happiness dashboard)
13
18. Reason 1. Decreasing changes of errors
» Source and great post: http://www.devops.ch/2017/05/10/devops-explained/
18
19. Reason 2: Eliminating toil, work that is:
» Repetitive
» Automatable
» Doesn't provide enduring value
» Scales linearly with service
» Compounds significantly and surprisingly
19
20. Reason 3: Focusing on creative
engineering work that:
» Improves reliability
» Improves performance & stability of systems
» Ensures scalability
» Reduces toil
» Is fun: improves morale, speeds up progress, allows skill
development
20
22. Being oncall
» Shared among developers (roles, not individuals, increase bus
factor) Responsible for the platform
» Safety net - you know who to call
» Runbooks - you know what to do
» Early alerting - proactively investigate
22
23. Incident response
If critical: Incident commander role Separate roles, if necessary:
» outbound and inbound communication
» root cause analysis
» issue mitigation
Tracking time (incident ack expiration) and keeping track Tooling
(alerts, paging, postmortem reminders)
23