1) SRE teams are responsible for maintaining service reliability, availability, performance and efficiency through monitoring, on-call response, and capacity planning.
2) Key metrics (SLIs) are tracked and compared to targets (SLOs) with error budgets accounting for differences. Metrics should be focused on what matters to users and enable meaningful SLIs.
3) Recurring, manual tasks that don't provide long-term benefits (toil) should be minimized through automation, distributing work evenly, and promptly addressing issues to free up engineers' time for more meaningful work.
2. About me
Consultant at NETWAYS since 2021
Working with technologies like
Kubernetes
Ansible
Prometheus
Grafana
Interested in DevOps and SRE practices
3. Site
In the beginnings, the site - google.com
nowadays, an arbitrary service
Maps
Mail
Adds
provided by an SRE team
consumed by users
4. Reliability
The ability of a service to perform as expected upon request
This is not the same as availability
A service can be available but not reliable
high latency
high error rates
5. Engineering
Members of SRE teams are…
⛔ classical system administrators
⛔ mainly systems engineers
✅ software engineers, first and foremost
6. “[At Google, ] Common to all SREs is the belief in
and aptitude for developing software systems to
solve complex problems”
Benjamin Treynor Sloss, Vice President, Google Engineering, founder of Google SRE
9. SLIs, SLOs, and Error
Budgets
SLIs (Service Level Indicators) are key
metrics for your service(s)
SLOs (Service Level Objectives) are targets
for your SLIs
Error Budgets are the difference between
your SLOs and your actual SLIs
Photo by Emil Kalibradov on Unsplash
10. Metrics Collection
Looking at all metrics all the time is not
feasible
Focus on metrics that
matter for your end users
can be used to forge meaningful SLIs
KISS - Keep It Short and Simple!
Photo by Tim Mossholder on Unsplash
11. SLI Generalization
Generalize SLIs:
Choose sane defaults for aggregation:
intervals (e.g. 5 minutes)
methods (e.g. average)
regions (e.g. cluster-wide)
resolutions (e.g. 10 second)
DRY - Don’t Repeat Yourself!
Photo by Stephen Phillips on Unsplash
13. Identifying Toil
Recurring, boring tasks that don’t lead to long-
term benefits:
🗑️ Manually restarting services
🗑️ Manually executing scripts
♻️ Handling pager alerts (for the first time)
♻️ Refactoring code to reduce technical debt
Photo by the blowup on Unsplash
14. Managing Toil
Distribute it evenly across the team
Do it immediately
Chip away at it, week by week
Photo by Luis Villasmil on Unsplash
15. Minimizing Toil
Aim for
automatic, not automated systems and
solutions
needed maintenance scaling < O(n)
a shared mindset that some toil is
inevitable, but too much is unacceptable
Photo by Gary Chan on Unsplash
17. On-Call is Toil
On-Call time is a natural lower bound to the amount of toil that can’t be reduced.
Be careful with introducing/allowing additional toil.
18. Balance is Key
At Google, two incidents per shift are seen as a
good balance, leaving enough time for proper
handling and postmortems.
More incidents, and handling becomes
hasty
Less incidents, and the on-call engineer’s
time gets wasted
Photo by Alexander Andrews on Unsplash
19. Keep in mind
Let your SREs engineer, not just operate
Stay on top of our SLIs, SLOs, Error Budgets,
and Toil
Act accordingly
Staff your SRE team(s) appropriately
Photo by Diego PH on Unsplash
20. There’s so much more
Release Engineering
Engineering for Automation
Incident Management
… and much more. Maybe another time!
Photo by Hadija on Unsplash