SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Downloaden Sie, um offline zu lesen
SRE Demystified
Eliminate Toil
ganesh@ganeshniyer.com
ganesh.vigneswara@gmail.com,
http://ganeshniyer.com
Dr Ganesh Neelakanta Iyer
SRE
•
2https://image.slidesharecdn.com/devopssreatgooglescale-190121123035/95/devops-sre-at-google-scale-30-638.jpg?cb=1548074257
Toil
• Toil is the kind of
work tied to running
a production service
that tends to be
manual, repetitive,
automatable,
tactical, devoid of
enduring value, and
that scales linearly
as a service grows
3https://landing.google.com/sre/workbook/chapters/eliminating-toil/
What is NOT toil?
• Toil is not just "work I don’t like to do.”
• It’s also not simply equivalent to administrative chores or
grungy work
• There are also administrative chores that have to get done,
but should not be categorized as toil: this is overhead
• It includes tasks like team meetings, setting goals and HR
paperwork
• Cleaning up the entire alerting configuration for your
service and removing clutter may be grungy, but it’s not toil
4https://landing.google.com/sre/workbook/chapters/eliminating-toil/
Toil Defined
5
Manual Repetitive Automatable Tactical
No enduring Value O(n) with service growth
Manually running a
script (time spend
running the script)
Handling pager
alerts
Toil is work you do
over and over
If a machine could
accomplish the task just
as well as a human
If your service remains in the
same state after you have
finished a task, the task was
probably toil.
If the work involved in a task scales up
linearly with service size, traffic volume, or
user count, that task is probably toil.
https://landing.google.com/sre/workbook/chapters/eliminating-toil/
Examples
• Handling quota requests
• Applying database schema changes
• Reviewing non-critical monitoring
alerts
• Copying and pasting commands
from a playbook
6
https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles
https://www.rundeck.com/blog/sre-anti-pattern-known-workaround-bug-closed
Measuring the impact of the work
• What type of work was it (quota changes, push release to
production, ACL update, etc.)?
• What was the degree of difficulty: Easy (<1 hour);
Medium (hours); Hard (days) (based on human hands-on
time, not elapsed time)?
• Who did the work?
7
https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles
Identifying toil: Survey the team
• Averaging over the past four weeks, approximately what fraction of your time did you spend on toil?
• Scale 0-100%
• How happy are you with the quantity of time you spend on toil?
• Not happy / OK / No problem at all
• What are your top three sources of toil?
• On-call Response / Interrupts / Pushes / Capacity / Other / etc.
• Do you have a long-term engineering project in your quarterly objectives?
• Yes / No
• If so, averaging over the past four weeks, approximately what fraction of your time did you spend on
your engineering project? (estimate)
• Scale 0-100%
• In your team, is there toil you can automate away but you don’t do so, because that very toil takes
time away from long-term engineering work? If so, please describe below.
• Open response
8
https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles
Measuring Toil
• Regularly, compute an
estimate of how much time is
being spent on various types
of work
• Look for patterns or trends in
your tickets, surveys, and on-
call incident response, and
prioritize based on the
aggregate human time spent
9
https://www.rundeck.com/blog/sre-anti-pattern-known-workaround-bug-closed
https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles
Eliminating Toil
• Treat your automation like any other production system
• If you have an SLO practice, use some of your error
budget to automate away toil
• Complete postmortems when your automation fails, and
fix it as you would any user-facing system
• You want your automation available to you in any
situation, including production incidents, to free humans
to do the work they’re good at
10
https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles
References
11
Dr Ganesh Neelakanta Iyer
ganesh@ganeshniyer.com
ganesh.vigneswara@gmail.com

Weitere ähnliche Inhalte

Was ist angesagt?

Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
Knoldus Inc.
 

Was ist angesagt? (20)

Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
 
When down is not good enough. SRE On Azure - PolarConf
When down is not good enough. SRE On Azure - PolarConfWhen down is not good enough. SRE On Azure - PolarConf
When down is not good enough. SRE On Azure - PolarConf
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
 
Sre summary
Sre summarySre summary
Sre summary
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
 
Site (Service) Reliability Engineering
Site (Service) Reliability EngineeringSite (Service) Reliability Engineering
Site (Service) Reliability Engineering
 
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)
 
SRE 101
SRE 101SRE 101
SRE 101
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRE
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
 
SRE vs DevOps
SRE vs DevOpsSRE vs DevOps
SRE vs DevOps
 
Reconstructing the SRE
Reconstructing the SREReconstructing the SRE
Reconstructing the SRE
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability Engineering
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
 
Api observability
Api observability Api observability
Api observability
 

Ähnlich wie SRE Demystified - 05 - Toil Elimination

PI Boot Camp 2015.06 Participant Packet
PI Boot Camp 2015.06 Participant PacketPI Boot Camp 2015.06 Participant Packet
PI Boot Camp 2015.06 Participant Packet
Mike Rudolf
 
Time management.pdf
Time management.pdfTime management.pdf
Time management.pdf
fiweif
 

Ähnlich wie SRE Demystified - 05 - Toil Elimination (20)

Job Analysis.pptx
Job Analysis.pptxJob Analysis.pptx
Job Analysis.pptx
 
EngManagement - Lecture 7.pptx
EngManagement - Lecture 7.pptxEngManagement - Lecture 7.pptx
EngManagement - Lecture 7.pptx
 
PI Boot Camp 2015.06 Participant Packet
PI Boot Camp 2015.06 Participant PacketPI Boot Camp 2015.06 Participant Packet
PI Boot Camp 2015.06 Participant Packet
 
Introduction to processes and procedures
Introduction to processes and proceduresIntroduction to processes and procedures
Introduction to processes and procedures
 
Introduction of Career Development - 2 - Copy.pptx
Introduction of Career Development - 2 - Copy.pptxIntroduction of Career Development - 2 - Copy.pptx
Introduction of Career Development - 2 - Copy.pptx
 
Time Management & Worklife Balance training course
Time Management & Worklife Balance training courseTime Management & Worklife Balance training course
Time Management & Worklife Balance training course
 
Time management.pptx
Time management.pptxTime management.pptx
Time management.pptx
 
Driving Change with Data: Getting Started with Continuous Improvement
Driving Change with Data: Getting Started with Continuous ImprovementDriving Change with Data: Getting Started with Continuous Improvement
Driving Change with Data: Getting Started with Continuous Improvement
 
Time management.pdf
Time management.pdfTime management.pdf
Time management.pdf
 
ENHANCING EFFICIENCY THROUGH MANAGEMENT OF WORKLOAD & RESOURCES.pptx
ENHANCING EFFICIENCY THROUGH MANAGEMENT OF WORKLOAD & RESOURCES.pptxENHANCING EFFICIENCY THROUGH MANAGEMENT OF WORKLOAD & RESOURCES.pptx
ENHANCING EFFICIENCY THROUGH MANAGEMENT OF WORKLOAD & RESOURCES.pptx
 
Internal audit mechanism
Internal audit mechanismInternal audit mechanism
Internal audit mechanism
 
Job analysis & contengency
Job analysis & contengencyJob analysis & contengency
Job analysis & contengency
 
Performance management
Performance managementPerformance management
Performance management
 
Performance Evaluations for UIT
Performance Evaluations for UITPerformance Evaluations for UIT
Performance Evaluations for UIT
 
Sue Sheerin: Why self-assessment exciting?
Sue Sheerin: Why self-assessment exciting?Sue Sheerin: Why self-assessment exciting?
Sue Sheerin: Why self-assessment exciting?
 
Performance appraisal answers examples
Performance appraisal answers examplesPerformance appraisal answers examples
Performance appraisal answers examples
 
Bullseye Benefits Flyer
Bullseye Benefits FlyerBullseye Benefits Flyer
Bullseye Benefits Flyer
 
Demystifying Evaluation
Demystifying EvaluationDemystifying Evaluation
Demystifying Evaluation
 
How to Design Effective PMS Systems and KRA Sheets
How to Design Effective PMS Systems and KRA SheetsHow to Design Effective PMS Systems and KRA Sheets
How to Design Effective PMS Systems and KRA Sheets
 
The Importance of Delegation - key ways to grow your business
The Importance of Delegation - key ways to grow your business The Importance of Delegation - key ways to grow your business
The Importance of Delegation - key ways to grow your business
 

Mehr von Dr Ganesh Iyer

Mehr von Dr Ganesh Iyer (20)

SRE Demystified - 16 - NALSD - Non-Abstract Large System Design
SRE Demystified - 16 - NALSD - Non-Abstract Large System DesignSRE Demystified - 16 - NALSD - Non-Abstract Large System Design
SRE Demystified - 16 - NALSD - Non-Abstract Large System Design
 
SRE Demystified - 14 - SRE Practices overview
SRE Demystified - 14 - SRE Practices overviewSRE Demystified - 14 - SRE Practices overview
SRE Demystified - 14 - SRE Practices overview
 
SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2
 
SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1 SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1
 
SRE Demystified - 11 - Release management-2
SRE Demystified - 11 - Release management-2SRE Demystified - 11 - Release management-2
SRE Demystified - 11 - Release management-2
 
SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1
 
SRE Demystified - 09 - Simplicity
SRE Demystified - 09 - SimplicitySRE Demystified - 09 - Simplicity
SRE Demystified - 09 - Simplicity
 
SRE Demystified - 07 - Practical Alerting
SRE Demystified - 07 - Practical AlertingSRE Demystified - 07 - Practical Alerting
SRE Demystified - 07 - Practical Alerting
 
SRE Demystified - 06 - Distributed Monitoring
SRE Demystified - 06 - Distributed MonitoringSRE Demystified - 06 - Distributed Monitoring
SRE Demystified - 06 - Distributed Monitoring
 
SRE Demystified - 04 - Engagement Model
SRE Demystified - 04 - Engagement ModelSRE Demystified - 04 - Engagement Model
SRE Demystified - 04 - Engagement Model
 
SRE Demystified - 03 - Choosing SLIs and SLOs
SRE Demystified - 03 - Choosing SLIs and SLOsSRE Demystified - 03 - Choosing SLIs and SLOs
SRE Demystified - 03 - Choosing SLIs and SLOs
 
Machine Learning for Statisticians - Introduction
Machine Learning for Statisticians - IntroductionMachine Learning for Statisticians - Introduction
Machine Learning for Statisticians - Introduction
 
Making Decisions - A Game Theoretic approach
Making Decisions - A Game Theoretic approachMaking Decisions - A Game Theoretic approach
Making Decisions - A Game Theoretic approach
 
Cloud and Industry4.0
Cloud and Industry4.0Cloud and Industry4.0
Cloud and Industry4.0
 
Game Theory and Engineering Applications
Game Theory and Engineering ApplicationsGame Theory and Engineering Applications
Game Theory and Engineering Applications
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its Applications
 
How to become a successful entrepreneur
How to become a successful entrepreneurHow to become a successful entrepreneur
How to become a successful entrepreneur
 
Dockers and kubernetes
Dockers and kubernetesDockers and kubernetes
Dockers and kubernetes
 
Containerization Principles Overview for app development and deployment
Containerization Principles Overview for app development and deploymentContainerization Principles Overview for app development and deployment
Containerization Principles Overview for app development and deployment
 
Game Theory and Engineering Applications
Game Theory and Engineering ApplicationsGame Theory and Engineering Applications
Game Theory and Engineering Applications
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

SRE Demystified - 05 - Toil Elimination

  • 3. Toil • Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows 3https://landing.google.com/sre/workbook/chapters/eliminating-toil/
  • 4. What is NOT toil? • Toil is not just "work I don’t like to do.” • It’s also not simply equivalent to administrative chores or grungy work • There are also administrative chores that have to get done, but should not be categorized as toil: this is overhead • It includes tasks like team meetings, setting goals and HR paperwork • Cleaning up the entire alerting configuration for your service and removing clutter may be grungy, but it’s not toil 4https://landing.google.com/sre/workbook/chapters/eliminating-toil/
  • 5. Toil Defined 5 Manual Repetitive Automatable Tactical No enduring Value O(n) with service growth Manually running a script (time spend running the script) Handling pager alerts Toil is work you do over and over If a machine could accomplish the task just as well as a human If your service remains in the same state after you have finished a task, the task was probably toil. If the work involved in a task scales up linearly with service size, traffic volume, or user count, that task is probably toil. https://landing.google.com/sre/workbook/chapters/eliminating-toil/
  • 6. Examples • Handling quota requests • Applying database schema changes • Reviewing non-critical monitoring alerts • Copying and pasting commands from a playbook 6 https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles https://www.rundeck.com/blog/sre-anti-pattern-known-workaround-bug-closed
  • 7. Measuring the impact of the work • What type of work was it (quota changes, push release to production, ACL update, etc.)? • What was the degree of difficulty: Easy (<1 hour); Medium (hours); Hard (days) (based on human hands-on time, not elapsed time)? • Who did the work? 7 https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles
  • 8. Identifying toil: Survey the team • Averaging over the past four weeks, approximately what fraction of your time did you spend on toil? • Scale 0-100% • How happy are you with the quantity of time you spend on toil? • Not happy / OK / No problem at all • What are your top three sources of toil? • On-call Response / Interrupts / Pushes / Capacity / Other / etc. • Do you have a long-term engineering project in your quarterly objectives? • Yes / No • If so, averaging over the past four weeks, approximately what fraction of your time did you spend on your engineering project? (estimate) • Scale 0-100% • In your team, is there toil you can automate away but you don’t do so, because that very toil takes time away from long-term engineering work? If so, please describe below. • Open response 8 https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles
  • 9. Measuring Toil • Regularly, compute an estimate of how much time is being spent on various types of work • Look for patterns or trends in your tickets, surveys, and on- call incident response, and prioritize based on the aggregate human time spent 9 https://www.rundeck.com/blog/sre-anti-pattern-known-workaround-bug-closed https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles
  • 10. Eliminating Toil • Treat your automation like any other production system • If you have an SLO practice, use some of your error budget to automate away toil • Complete postmortems when your automation fails, and fix it as you would any user-facing system • You want your automation available to you in any situation, including production incidents, to free humans to do the work they’re good at 10 https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles
  • 12. Dr Ganesh Neelakanta Iyer ganesh@ganeshniyer.com ganesh.vigneswara@gmail.com