SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Metrics, Margins, Mutiny
Wednesday, Nov 8 2023
How to make your SREs (not) run away
@d_bodky
About me
Consultant at NETWAYS since 2021
Working with technologies like
Kubernetes
Ansible
Prometheus
Grafana
Interested in DevOps and SRE practices
Site
In the beginnings, the site - google.com
nowadays, an arbitrary service
Maps
Mail
Adds
provided by an SRE team
consumed by users
Reliability
The ability of a service to perform as expected upon request
This is not the same as availability
A service can be available but not reliable
high latency
high error rates
Engineering
Members of SRE teams are…
⛔ classical system administrators
⛔ mainly systems engineers
✅ software engineers, first and foremost
“[At Google, ] Common to all SREs is the belief in
and aptitude for developing software systems to
solve complex problems”
Benjamin Treynor Sloss, Vice President, Google Engineering, founder of Google SRE
Responsibilities
Availability
Latency
Performance
Efficiency
Change Management
Monitoring
Emergency Response
Capacity Planning
SLIs, SLOs, and Error Budgets
SLIs, SLOs, and Error
Budgets
SLIs (Service Level Indicators) are key
metrics for your service(s)
SLOs (Service Level Objectives) are targets
for your SLIs
Error Budgets are the difference between
your SLOs and your actual SLIs
Photo by Emil Kalibradov on Unsplash
Metrics Collection
Looking at all metrics all the time is not
feasible
Focus on metrics that
matter for your end users
can be used to forge meaningful SLIs
KISS - Keep It Short and Simple!
Photo by Tim Mossholder on Unsplash
SLI Generalization
Generalize SLIs:
Choose sane defaults for aggregation:
intervals (e.g. 5 minutes)
methods (e.g. average)
regions (e.g. cluster-wide)
resolutions (e.g. 10 second)
DRY - Don’t Repeat Yourself!
Photo by Stephen Phillips on Unsplash
Dealing with Toil
Identifying Toil
Recurring, boring tasks that don’t lead to long-
term benefits:
🗑️ Manually restarting services
🗑️ Manually executing scripts
♻️ Handling pager alerts (for the first time)
♻️ Refactoring code to reduce technical debt
Photo by the blowup on Unsplash
Managing Toil
Distribute it evenly across the team
Do it immediately
Chip away at it, week by week
Photo by Luis Villasmil on Unsplash
Minimizing Toil
Aim for
automatic, not automated systems and
solutions
needed maintenance scaling < O(n)
a shared mindset that some toil is
inevitable, but too much is unacceptable
Photo by Gary Chan on Unsplash
Being On-Call
On-Call is Toil
On-Call time is a natural lower bound to the amount of toil that can’t be reduced.
Be careful with introducing/allowing additional toil.
Balance is Key
At Google, two incidents per shift are seen as a
good balance, leaving enough time for proper
handling and postmortems.
More incidents, and handling becomes
hasty
Less incidents, and the on-call engineer’s
time gets wasted
Photo by Alexander Andrews on Unsplash
Keep in mind
Let your SREs engineer, not just operate
Stay on top of our SLIs, SLOs, Error Budgets,
and Toil
Act accordingly
Staff your SRE team(s) appropriately
Photo by Diego PH on Unsplash
There’s so much more
Release Engineering
Engineering for Automation
Incident Management
… and much more. Maybe another time!
Photo by Hadija on Unsplash

Weitere ähnliche Inhalte

Ähnlich wie OSMC 2023 | IGNITE: Metrics, Margins, Mutiny – How to make your SREs (not) run away by Daniel Bodky

SAD07 - Project Management
SAD07 - Project ManagementSAD07 - Project Management
SAD07 - Project ManagementMichael Heron
 
Agilelessons scanagile-final 2013
Agilelessons scanagile-final 2013Agilelessons scanagile-final 2013
Agilelessons scanagile-final 2013lokori
 
Be Agile Rather Than Do Agile
Be Agile Rather Than Do AgileBe Agile Rather Than Do Agile
Be Agile Rather Than Do AgileBrenda Bao
 
Site-Reliability-Engineering-v2[6241].pdf
Site-Reliability-Engineering-v2[6241].pdfSite-Reliability-Engineering-v2[6241].pdf
Site-Reliability-Engineering-v2[6241].pdfDeepakGupta747774
 
Reducing Time Spent On Requirements
Reducing Time Spent On RequirementsReducing Time Spent On Requirements
Reducing Time Spent On RequirementsByron Workman
 
Agile Development Brown Bag Lunches Slides
Agile Development Brown Bag Lunches SlidesAgile Development Brown Bag Lunches Slides
Agile Development Brown Bag Lunches Slidesguesta1c5d7
 
YEG DPM Talk - January 16, 2017
YEG DPM Talk - January 16, 2017YEG DPM Talk - January 16, 2017
YEG DPM Talk - January 16, 2017Kayla Baretta
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
 
Page 1A Payroll Automation ProposalPart C – Project Plan.docx
Page  1A Payroll Automation ProposalPart C – Project Plan.docxPage  1A Payroll Automation ProposalPart C – Project Plan.docx
Page 1A Payroll Automation ProposalPart C – Project Plan.docxalfred4lewis58146
 
Quiz 9
Quiz 9Quiz 9
Quiz 9jiml59
 
Feedback loops - the second way towards the world of DevOps
Feedback loops - the second way towards the world of DevOpsFeedback loops - the second way towards the world of DevOps
Feedback loops - the second way towards the world of DevOpsTapio Rautonen
 
2011 06 15 velocity conf from visible ops to dev ops final
2011 06 15 velocity conf   from visible ops to dev ops final2011 06 15 velocity conf   from visible ops to dev ops final
2011 06 15 velocity conf from visible ops to dev ops finalGene Kim
 
Continuous Deployment
Continuous DeploymentContinuous Deployment
Continuous DeploymentBrian Henerey
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Brian Brazil
 
Tri State Final
Tri State FinalTri State Final
Tri State FinalSamWagner
 
A Pattern-Language-for-software-Development
A Pattern-Language-for-software-DevelopmentA Pattern-Language-for-software-Development
A Pattern-Language-for-software-DevelopmentShiraz316
 
DevOps - Understanding Core Concepts
DevOps - Understanding Core ConceptsDevOps - Understanding Core Concepts
DevOps - Understanding Core ConceptsNitin Bhide
 
Introduction to Lean Software Development
Introduction to Lean Software DevelopmentIntroduction to Lean Software Development
Introduction to Lean Software DevelopmentMichael Vax
 
Best Practices When Moving To Agile Project Management
Best Practices When Moving To Agile Project ManagementBest Practices When Moving To Agile Project Management
Best Practices When Moving To Agile Project ManagementRobert McGeachy
 
Agile and Scrum Workshop
Agile and Scrum WorkshopAgile and Scrum Workshop
Agile and Scrum WorkshopRainer Stropek
 

Ähnlich wie OSMC 2023 | IGNITE: Metrics, Margins, Mutiny – How to make your SREs (not) run away by Daniel Bodky (20)

SAD07 - Project Management
SAD07 - Project ManagementSAD07 - Project Management
SAD07 - Project Management
 
Agilelessons scanagile-final 2013
Agilelessons scanagile-final 2013Agilelessons scanagile-final 2013
Agilelessons scanagile-final 2013
 
Be Agile Rather Than Do Agile
Be Agile Rather Than Do AgileBe Agile Rather Than Do Agile
Be Agile Rather Than Do Agile
 
Site-Reliability-Engineering-v2[6241].pdf
Site-Reliability-Engineering-v2[6241].pdfSite-Reliability-Engineering-v2[6241].pdf
Site-Reliability-Engineering-v2[6241].pdf
 
Reducing Time Spent On Requirements
Reducing Time Spent On RequirementsReducing Time Spent On Requirements
Reducing Time Spent On Requirements
 
Agile Development Brown Bag Lunches Slides
Agile Development Brown Bag Lunches SlidesAgile Development Brown Bag Lunches Slides
Agile Development Brown Bag Lunches Slides
 
YEG DPM Talk - January 16, 2017
YEG DPM Talk - January 16, 2017YEG DPM Talk - January 16, 2017
YEG DPM Talk - January 16, 2017
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Page 1A Payroll Automation ProposalPart C – Project Plan.docx
Page  1A Payroll Automation ProposalPart C – Project Plan.docxPage  1A Payroll Automation ProposalPart C – Project Plan.docx
Page 1A Payroll Automation ProposalPart C – Project Plan.docx
 
Quiz 9
Quiz 9Quiz 9
Quiz 9
 
Feedback loops - the second way towards the world of DevOps
Feedback loops - the second way towards the world of DevOpsFeedback loops - the second way towards the world of DevOps
Feedback loops - the second way towards the world of DevOps
 
2011 06 15 velocity conf from visible ops to dev ops final
2011 06 15 velocity conf   from visible ops to dev ops final2011 06 15 velocity conf   from visible ops to dev ops final
2011 06 15 velocity conf from visible ops to dev ops final
 
Continuous Deployment
Continuous DeploymentContinuous Deployment
Continuous Deployment
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
Tri State Final
Tri State FinalTri State Final
Tri State Final
 
A Pattern-Language-for-software-Development
A Pattern-Language-for-software-DevelopmentA Pattern-Language-for-software-Development
A Pattern-Language-for-software-Development
 
DevOps - Understanding Core Concepts
DevOps - Understanding Core ConceptsDevOps - Understanding Core Concepts
DevOps - Understanding Core Concepts
 
Introduction to Lean Software Development
Introduction to Lean Software DevelopmentIntroduction to Lean Software Development
Introduction to Lean Software Development
 
Best Practices When Moving To Agile Project Management
Best Practices When Moving To Agile Project ManagementBest Practices When Moving To Agile Project Management
Best Practices When Moving To Agile Project Management
 
Agile and Scrum Workshop
Agile and Scrum WorkshopAgile and Scrum Workshop
Agile and Scrum Workshop
 

Kürzlich hochgeladen

AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfSkillCertProExams
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfMahamudul Hasan
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Baileyhlharris
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIINhPhngng3
 
Zone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptxZone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptxlionnarsimharajumjf
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...David Celestin
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...amilabibi1
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar TrainingKylaCullinane
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatmentnswingard
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoKayode Fayemi
 
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...ZurliaSoop
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lodhisaajjda
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalFabian de Rijk
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityHung Le
 
Introduction to Artificial intelligence.
Introduction to Artificial intelligence.Introduction to Artificial intelligence.
Introduction to Artificial intelligence.thamaeteboho94
 

Kürzlich hochgeladen (17)

ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait Cityin kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Zone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptxZone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptx
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
 
Introduction to Artificial intelligence.
Introduction to Artificial intelligence.Introduction to Artificial intelligence.
Introduction to Artificial intelligence.
 

OSMC 2023 | IGNITE: Metrics, Margins, Mutiny – How to make your SREs (not) run away by Daniel Bodky

  • 1. Metrics, Margins, Mutiny Wednesday, Nov 8 2023 How to make your SREs (not) run away @d_bodky
  • 2. About me Consultant at NETWAYS since 2021 Working with technologies like Kubernetes Ansible Prometheus Grafana Interested in DevOps and SRE practices
  • 3. Site In the beginnings, the site - google.com nowadays, an arbitrary service Maps Mail Adds provided by an SRE team consumed by users
  • 4. Reliability The ability of a service to perform as expected upon request This is not the same as availability A service can be available but not reliable high latency high error rates
  • 5. Engineering Members of SRE teams are… ⛔ classical system administrators ⛔ mainly systems engineers ✅ software engineers, first and foremost
  • 6. “[At Google, ] Common to all SREs is the belief in and aptitude for developing software systems to solve complex problems” Benjamin Treynor Sloss, Vice President, Google Engineering, founder of Google SRE
  • 8. SLIs, SLOs, and Error Budgets
  • 9. SLIs, SLOs, and Error Budgets SLIs (Service Level Indicators) are key metrics for your service(s) SLOs (Service Level Objectives) are targets for your SLIs Error Budgets are the difference between your SLOs and your actual SLIs Photo by Emil Kalibradov on Unsplash
  • 10. Metrics Collection Looking at all metrics all the time is not feasible Focus on metrics that matter for your end users can be used to forge meaningful SLIs KISS - Keep It Short and Simple! Photo by Tim Mossholder on Unsplash
  • 11. SLI Generalization Generalize SLIs: Choose sane defaults for aggregation: intervals (e.g. 5 minutes) methods (e.g. average) regions (e.g. cluster-wide) resolutions (e.g. 10 second) DRY - Don’t Repeat Yourself! Photo by Stephen Phillips on Unsplash
  • 13. Identifying Toil Recurring, boring tasks that don’t lead to long- term benefits: 🗑️ Manually restarting services 🗑️ Manually executing scripts ♻️ Handling pager alerts (for the first time) ♻️ Refactoring code to reduce technical debt Photo by the blowup on Unsplash
  • 14. Managing Toil Distribute it evenly across the team Do it immediately Chip away at it, week by week Photo by Luis Villasmil on Unsplash
  • 15. Minimizing Toil Aim for automatic, not automated systems and solutions needed maintenance scaling < O(n) a shared mindset that some toil is inevitable, but too much is unacceptable Photo by Gary Chan on Unsplash
  • 17. On-Call is Toil On-Call time is a natural lower bound to the amount of toil that can’t be reduced. Be careful with introducing/allowing additional toil.
  • 18. Balance is Key At Google, two incidents per shift are seen as a good balance, leaving enough time for proper handling and postmortems. More incidents, and handling becomes hasty Less incidents, and the on-call engineer’s time gets wasted Photo by Alexander Andrews on Unsplash
  • 19. Keep in mind Let your SREs engineer, not just operate Stay on top of our SLIs, SLOs, Error Budgets, and Toil Act accordingly Staff your SRE team(s) appropriately Photo by Diego PH on Unsplash
  • 20. There’s so much more Release Engineering Engineering for Automation Incident Management … and much more. Maybe another time! Photo by Hadija on Unsplash