OSMC 2023 | IGNITE: Metrics, Margins, Mutiny – How to make your SREs (not) run away by Daniel Bodky

•

0 gefällt mir•14 views

1) SRE teams are responsible for maintaining service reliability, availability, performance and efficiency through monitoring, on-call response, and capacity planning. 2) Key metrics (SLIs) are tracked and compared to targets (SLOs) with error budgets accounting for differences. Metrics should be focused on what matters to users and enable meaningful SLIs. 3) Recurring, manual tasks that don't provide long-term benefits (toil) should be minimized through automation, distributing work evenly, and promptly addressing issues to free up engineers' time for more meaningful work.

Präsentationen & Vorträge

Metrics, Margins, Mutiny
Wednesday, Nov 8 2023
How to make your SREs (not) run away
@d_bodky

About me
Consultant at NETWAYS since 2021
Working with technologies like
Kubernetes
Ansible
Prometheus
Grafana
Interested in DevOps and SRE practices

Site
In the beginnings, the site - google.com
nowadays, an arbitrary service
Maps
Mail
Adds
provided by an SRE team
consumed by users

Reliability
The ability of a service to perform as expected upon request
This is not the same as availability
A service can be available but not reliable
high latency
high error rates

Engineering
Members of SRE teams are…
⛔ classical system administrators
⛔ mainly systems engineers
✅ software engineers, first and foremost

“[At Google, ] Common to all SREs is the belief in
and aptitude for developing software systems to
solve complex problems”
Benjamin Treynor Sloss, Vice President, Google Engineering, founder of Google SRE

Responsibilities
Availability
Latency
Performance
Efficiency
Change Management
Monitoring
Emergency Response
Capacity Planning

SLIs, SLOs, and Error
Budgets
SLIs (Service Level Indicators) are key
metrics for your service(s)
SLOs (Service Level Objectives) are targets
for your SLIs
Error Budgets are the difference between
your SLOs and your actual SLIs
Photo by Emil Kalibradov on Unsplash

Metrics Collection
Looking at all metrics all the time is not
feasible
Focus on metrics that
matter for your end users
can be used to forge meaningful SLIs
KISS - Keep It Short and Simple!
Photo by Tim Mossholder on Unsplash

SLI Generalization
Generalize SLIs:
Choose sane defaults for aggregation:
intervals (e.g. 5 minutes)
methods (e.g. average)
regions (e.g. cluster-wide)
resolutions (e.g. 10 second)
DRY - Don’t Repeat Yourself!
Photo by Stephen Phillips on Unsplash

Identifying Toil
Recurring, boring tasks that don’t lead to long-
term benefits:
🗑️ Manually restarting services
🗑️ Manually executing scripts
♻️ Handling pager alerts (for the first time)
♻️ Refactoring code to reduce technical debt
Photo by the blowup on Unsplash

Managing Toil
Distribute it evenly across the team
Do it immediately
Chip away at it, week by week
Photo by Luis Villasmil on Unsplash

Minimizing Toil
Aim for
automatic, not automated systems and
solutions
needed maintenance scaling < O(n)
a shared mindset that some toil is
inevitable, but too much is unacceptable
Photo by Gary Chan on Unsplash

On-Call is Toil
On-Call time is a natural lower bound to the amount of toil that can’t be reduced.
Be careful with introducing/allowing additional toil.

Balance is Key
At Google, two incidents per shift are seen as a
good balance, leaving enough time for proper
handling and postmortems.
More incidents, and handling becomes
hasty
Less incidents, and the on-call engineer’s
time gets wasted
Photo by Alexander Andrews on Unsplash

Keep in mind
Let your SREs engineer, not just operate
Stay on top of our SLIs, SLOs, Error Budgets,
and Toil
Act accordingly
Staff your SRE team(s) appropriately
Photo by Diego PH on Unsplash

There’s so much more
Release Engineering
Engineering for Automation
Incident Management
… and much more. Maybe another time!
Photo by Hadija on Unsplash

Weitere ähnliche Inhalte

Ähnlich wie OSMC 2023 | IGNITE: Metrics, Margins, Mutiny – How to make your SREs (not) run away by Daniel Bodky

SAD07 - Project ManagementMichael Heron

Agilelessons scanagile-final 2013lokori

Be Agile Rather Than Do AgileBrenda Bao

Site-Reliability-Engineering-v2[6241].pdfDeepakGupta747774

Reducing Time Spent On RequirementsByron Workman

Agile Development Brown Bag Lunches Slidesguesta1c5d7

YEG DPM Talk - January 16, 2017Kayla Baretta

Prometheus - Open Source Forum JapanBrian Brazil

Page 1A Payroll Automation ProposalPart C – Project Plan.docxalfred4lewis58146

Quiz 9jiml59

Feedback loops - the second way towards the world of DevOpsTapio Rautonen

2011 06 15 velocity conf from visible ops to dev ops finalGene Kim

Continuous DeploymentBrian Henerey

Prometheus (Prometheus London, 2016)Brian Brazil

Tri State FinalSamWagner

A Pattern-Language-for-software-DevelopmentShiraz316

DevOps - Understanding Core ConceptsNitin Bhide

Introduction to Lean Software DevelopmentMichael Vax

Best Practices When Moving To Agile Project ManagementRobert McGeachy

Agile and Scrum WorkshopRainer Stropek

Ähnlich wie OSMC 2023 | IGNITE: Metrics, Margins, Mutiny – How to make your SREs (not) run away by Daniel Bodky (20)

SAD07 - Project Management

Agilelessons scanagile-final 2013

Be Agile Rather Than Do Agile

Site-Reliability-Engineering-v2[6241].pdf

Reducing Time Spent On Requirements

Agile Development Brown Bag Lunches Slides

YEG DPM Talk - January 16, 2017

Prometheus - Open Source Forum Japan

Page 1A Payroll Automation ProposalPart C – Project Plan.docx

Quiz 9

Feedback loops - the second way towards the world of DevOps

2011 06 15 velocity conf from visible ops to dev ops final

Continuous Deployment

Prometheus (Prometheus London, 2016)

Tri State Final

A Pattern-Language-for-software-Development

DevOps - Understanding Core Concepts

Introduction to Lean Software Development

Best Practices When Moving To Agile Project Management

Agile and Scrum Workshop

Kürzlich hochgeladen

ICT role in 21st century education and it's challenges.pdfIslamia university of Rahim Yar khan campus

AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfSkillCertProExams

SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfMahamudul Hasan

My Presentation "In Your Hands" by Halle Baileyhlharris

in kuwait௹+918133066128....) @abortion pills for sale in Kuwait CityAbortion pills in Kuwait Cytotec pills in Kuwait

Dreaming Music Video Treatment _ Project & Portfolio IIINhPhngng3

Zone Chairperson Role and Responsibilities New updated.pptxlionnarsimharajumjf

Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...David Celestin

Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...amilabibi1

Report Writing Webinar TrainingKylaCullinane

Dreaming Marissa Sánchez Music Video Treatmentnswingard

Uncommon Grace The Autobiography of Isaac FolorunsoKayode Fayemi

Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...ZurliaSoop

lONG QUESTION ANSWER PAKISTAN STUDIES10.lodhisaajjda

Digital collaboration with Microsoft 365 as extension of DrupalFabian de Rijk

Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityHung Le

Introduction to Artificial intelligence.thamaeteboho94

Kürzlich hochgeladen (17)

ICT role in 21st century education and it's challenges.pdf

AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf

SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf

My Presentation "In Your Hands" by Halle Bailey

in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City

Dreaming Music Video Treatment _ Project & Portfolio III

Zone Chairperson Role and Responsibilities New updated.pptx

Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...

Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...

Report Writing Webinar Training

Dreaming Marissa Sánchez Music Video Treatment

Uncommon Grace The Autobiography of Isaac Folorunso

Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...

lONG QUESTION ANSWER PAKISTAN STUDIES10.

Digital collaboration with Microsoft 365 as extension of Drupal

Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity

Introduction to Artificial intelligence.

OSMC 2023 | IGNITE: Metrics, Margins, Mutiny – How to make your SREs (not) run away by Daniel Bodky

1. Metrics, Margins, Mutiny Wednesday, Nov 8 2023 How to make your SREs (not) run away @d_bodky

2. About me Consultant at NETWAYS since 2021 Working with technologies like Kubernetes Ansible Prometheus Grafana Interested in DevOps and SRE practices

3. Site In the beginnings, the site - google.com nowadays, an arbitrary service Maps Mail Adds provided by an SRE team consumed by users

4. Reliability The ability of a service to perform as expected upon request This is not the same as availability A service can be available but not reliable high latency high error rates

5. Engineering Members of SRE teams are… ⛔ classical system administrators ⛔ mainly systems engineers ✅ software engineers, first and foremost

6. “[At Google, ] Common to all SREs is the belief in and aptitude for developing software systems to solve complex problems” Benjamin Treynor Sloss, Vice President, Google Engineering, founder of Google SRE

7. Responsibilities Availability Latency Performance Efficiency Change Management Monitoring Emergency Response Capacity Planning

8. SLIs, SLOs, and Error Budgets

9. SLIs, SLOs, and Error Budgets SLIs (Service Level Indicators) are key metrics for your service(s) SLOs (Service Level Objectives) are targets for your SLIs Error Budgets are the difference between your SLOs and your actual SLIs Photo by Emil Kalibradov on Unsplash

10. Metrics Collection Looking at all metrics all the time is not feasible Focus on metrics that matter for your end users can be used to forge meaningful SLIs KISS - Keep It Short and Simple! Photo by Tim Mossholder on Unsplash

11. SLI Generalization Generalize SLIs: Choose sane defaults for aggregation: intervals (e.g. 5 minutes) methods (e.g. average) regions (e.g. cluster-wide) resolutions (e.g. 10 second) DRY - Don’t Repeat Yourself! Photo by Stephen Phillips on Unsplash

12. Dealing with Toil

13. Identifying Toil Recurring, boring tasks that don’t lead to long- term benefits: 🗑️ Manually restarting services 🗑️ Manually executing scripts ♻️ Handling pager alerts (for the first time) ♻️ Refactoring code to reduce technical debt Photo by the blowup on Unsplash

14. Managing Toil Distribute it evenly across the team Do it immediately Chip away at it, week by week Photo by Luis Villasmil on Unsplash

15. Minimizing Toil Aim for automatic, not automated systems and solutions needed maintenance scaling < O(n) a shared mindset that some toil is inevitable, but too much is unacceptable Photo by Gary Chan on Unsplash

16. Being On-Call

17. On-Call is Toil On-Call time is a natural lower bound to the amount of toil that can’t be reduced. Be careful with introducing/allowing additional toil.

18. Balance is Key At Google, two incidents per shift are seen as a good balance, leaving enough time for proper handling and postmortems. More incidents, and handling becomes hasty Less incidents, and the on-call engineer’s time gets wasted Photo by Alexander Andrews on Unsplash

19. Keep in mind Let your SREs engineer, not just operate Stay on top of our SLIs, SLOs, Error Budgets, and Toil Act accordingly Staff your SRE team(s) appropriately Photo by Diego PH on Unsplash

20. There’s so much more Release Engineering Engineering for Automation Incident Management … and much more. Maybe another time! Photo by Hadija on Unsplash

OSMC 2023 | IGNITE: Metrics, Margins, Mutiny – How to make your SREs (not) run away by Daniel Bodky

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie OSMC 2023 | IGNITE: Metrics, Margins, Mutiny – How to make your SREs (not) run away by Daniel Bodky

Ähnlich wie OSMC 2023 | IGNITE: Metrics, Margins, Mutiny – How to make your SREs (not) run away by Daniel Bodky (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (17)

OSMC 2023 | IGNITE: Metrics, Margins, Mutiny – How to make your SREs (not) run away by Daniel Bodky