SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
A quick summary of
SRE – Site Reliability Engineering
Yogesh shah
Agenda
• What is SRE & its background
• Before going to SRE
• SRE and DevOps
• Components of SRE
• Reliability
• SLA
• SLO
• SLI
• Error budget
• Toil
• Things we did not cover
• References
What is SRE,
History &
Background
SRE = Site Reliability Engineering
Term SRE originated in google more than decade ago and it
has been backbone of Google’s highly reliable & valuable
suite of products & service
Google didn’t make details of SRE public as it thought that
it is the secrete sauce of their success
When DevOps movement stated, google could see that
there is lot of interest in implementing DevOps but there is
no clear path and people are struggling to implement
DevOps
Scrum, SAFe, Lean,
DevOps …………..
now SRE… 
• Framework Direction: Dev  Ops
• Flexibility: Rigid  open for interpretation
• Ease of implementation Easy  very hard
• Fit for market demand Less  high
Software delivery
mechanism
What is at the center Type Advantages Difficulties
Waterfall/ Project
management
Centers around: Plan
Outcome: Fixed target
Process • Easy to implement
• Scope, time, cost fixed
• Changing requirement
• Too heavy, complex & costly
ITIL
Centers around: SLA
Outcome: Predefined service quality
Framework • Easy to implement
• Clear accountability
• Predictable service quality
• Meet SLA != customer
satisfaction
• Too heavy & complex
Scrum/ SAFe
Centers around: Timebox, Focus
Outcome: delivery of Changing
requirement
Framework • Simple to understand • Difficult to implement
• Works best in pockets but
consistency is hard to achieve
Lean
Centers around: Flow of work
Outcome: Removal of waste
Methodology • Easy to implement
• Clear accountability
• Predictable service quality
• Meet SLA != customer
satisfaction
• Too heavy & complex
DevOps
Centers around: Unify Dev & Ops
Outcome: End to end accountability for
Dev & Ops
Philosophy • Great vision • Open to interpretation
What is SRE in comparison of others
• Centers around: Reliability
• Outcome: Customer satisfaction with control over balance of
Enhancement & Reliability
• Type: Implementation pattern
• Advantage: Implements DevOps,
• Disadvantage: None 
• Addresses so far neglected question “is system ready to handle change
without impacting customer experience?”
• SRE happens when a software engineer is tasked with what used to be
called operations.
SRE and DevOps But what is DevOps?
DevOps is about combined team (Dev & Ops)
using common set of tools & processes to deliver
any software change
SRE is an implementation of DevOps.
DevOps
Reduce organization silos
Accept failure as normal
Implement gradual change
Leverage tooling & automation
Measure everything
SRE
Share ownership with developers by using the same tools and techniques across the stack
Have a formula for balancing accidents and failures against new releases
Encourage moving quickly by reducing costs of failure
Encourages "automating this year's job away" and minimizing manual systems work to
focus on efforts that bring long-term value to the system
Believes that operations is a software problem, and defines prescriptive ways for
measuring availability, uptime, outages, toil, etc.
Components
of SRE
Reliability
SLA
SLO
SLI
Error Budget
Toil
Defining Reliability
•Clunky system with great features doesn’t work
•100% reliability is most often wrong target as it slows down velocity
•Reliability beyond a certain point has diminishing returns
•Each 9 after decimal point makes system 10 time more reliable but it costs 10 time more
Most important feature of any system
is its Reliability
•User, not monitoring metrics decide reliability hence in order to say system is reliable one
needs to measure user experienceUser Experience decides Reliability
•To achieve highly reliable (99.999…) systems well trained incident response team
(proactive & reactive) is required. Only talented developers & well engineered system is
not enough
Only engineering & talented
developer are not enough for highly
reliable systems. Well trained
incident response team is must
Reliability
• SRE helps defining reliability in clear way using concept of an error
budget
• Due to error budget understanding of reliability is understood
consistently across organization
• 100% reliability is wrong target as it slows down velocity
• User happiness and reliability is directly proportional till a point
beyond that user doesn’t care
SLA
• These are your agreements that you make with your customers about
the reliability of your service. An SLA has to have consequences if it's
violated
• Violating SLAs is costly affair in many aspects & hence getting a
informative warning with enough time to react is must to prevent
violation of SLA
SLO – Service Level Objectives
• Reliability is a feature hence it is prioritized against other functional features. However
prioritizing Reliability is challenging and SLOs are key to help in prioritizing Reliability
along with other features
• Target for specified reliability is SLO. In other words SLO is used to measure reliability
• SLO should always be stronger than your SLAs because customers are usually impacted
before the SLA is actually breached.
• SLO is effectively an internal promise to meet customer expectations. Violation of SLO
becomes really important issue as you are no longer have more outages so that you'll
want to take steps to remove risks from your service by devoting engineering
and automation efforts to reducing and eliminating areas of risks, etc.
• A good rule of thumb to set SLO targets is “happiness test” A threshold beyond which
user tends to become grumpy due to degraded service performance
• So Setting identifying and selecting SLO target is important but tough task and SRE has
clear guidelines to identify SLOs, set targets and revise SLO, Targets or both
SLI – Service Level Indicators
What is SLI
• Now we understand what is Reliability but how do we measure it?
• Reliability of service should be quantitative measure of customer experience. SRE helps you to
find suitable metric based on characteristics of your service
• The chosen metrics to measure level service provided to user is called SLI. In simple words It is a
quantitative measure of user experience
• Implementation to measure SLI metric changes based on implementation and environment
where service is operating
Relationship between SLI & SLO
• SLI is how is the service performing against that target at the given point in time
• SLO is the target we chose and measure SLI for period of time (e.g. 99% of requests are served within 2 seconds in last 4 weeks)
• SLI will tell us if certain time is good or bad based on measure of SLI against SLO target
• SLOs can be different for different times, different customer types, frequency of SLO misses etc. however concept of error budget
helps you manage this
How SRE helps
• SRE provide SLI menu for typical
user journey (system
characteristics)
• SRE provides simple formula to
measure SLIs. It is always ratio
(good events/ valid events)
• Provides blueprints to
implement SLI capture
mechanism along with tradeoffs
Error Budget
• Identifying, documenting and agreeing SLOs and SLIs can be great progress but how can
we make all this work?
• Error budget is useful
• actively balance Reliability of system against progress of other features in coherent manner
• To inform all how much head room is available before impacting customer experience
• It quantitatively informs how much failure or unreliability is allowed
• E.g.
• If intended reliability is 99.9% that means error budget is 0.1%
• 0.1% error budget = 40.32 mins of downtime over 28 days
• These 40.32 mins is SLO which we agree with all stakeholder. That means we have 40.32 mins for
recovering from any failure. Failure can be because of any reason hdd failure, bad code,
maintenance error, etc.
• It prompts lot of useful thinking.
• Assume that Reliability for your platform is 95% in 28 days. That means you are allowed to have
1.4 days of down time. Now do you really need CI-CD, Blue green deployment, test automation
etc.?
Toil
• Toil is work related to running production system/ service
• Toil satisfies following conditions
• manual
• Repetitive
• Automatable
• tactical
• devoid of long-term value
• Overhead (attending meeting, responding to email, etc.) is not a Toil
Not covered
• Detail steps and workshops for developing SLOs and SLIs
• Setting achievable SLO targets
• Define SLIs
• Manage growth of SLI parameter
• SLI menu, implementation patterns, tradeoffs and cost analysis
• Define and analyze error budget
• Error budget policy, thresholds and scenarios
• Identify and address SLO risks
• Consequences of missing SLO
• There is much more
References
• SRE Introduction – Set of videos about SRE introduction
• SRE – How google runs production systems
• SRE Workbook – Practical ways to implement SRE
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
Acquia
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
Knoldus Inc.
 

Was ist angesagt? (20)

Site (Service) Reliability Engineering
Site (Service) Reliability EngineeringSite (Service) Reliability Engineering
Site (Service) Reliability Engineering
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
 
SRE 101
SRE 101SRE 101
SRE 101
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil Elimination
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
 
SRE vs DevOps
SRE vs DevOpsSRE vs DevOps
SRE vs DevOps
 
SRE From Scratch
SRE From ScratchSRE From Scratch
SRE From Scratch
 
DevOps & SRE at Google Scale
DevOps & SRE at Google ScaleDevOps & SRE at Google Scale
DevOps & SRE at Google Scale
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability Engineering
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
 
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRE
 

Ähnlich wie Sre summary

Sdec10 lean package implementation
Sdec10 lean package implementationSdec10 lean package implementation
Sdec10 lean package implementation
Terry Bunio
 

Ähnlich wie Sre summary (20)

What is DevOps? How can it impact my Customers and my Business
What is DevOps? How can it impact my Customers and my BusinessWhat is DevOps? How can it impact my Customers and my Business
What is DevOps? How can it impact my Customers and my Business
 
Scaled Agile Framework® Overview
Scaled Agile Framework® OverviewScaled Agile Framework® Overview
Scaled Agile Framework® Overview
 
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
 
S.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systemsS.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systems
 
Dev ops
Dev opsDev ops
Dev ops
 
Stc 2016 regional-round-ppt-automation testing with devops in agile methodolgy
Stc 2016 regional-round-ppt-automation testing with devops in agile methodolgyStc 2016 regional-round-ppt-automation testing with devops in agile methodolgy
Stc 2016 regional-round-ppt-automation testing with devops in agile methodolgy
 
DevOps 101
DevOps 101DevOps 101
DevOps 101
 
TDWI STL 20140613 Agile - Paul Holway
TDWI STL 20140613 Agile - Paul HolwayTDWI STL 20140613 Agile - Paul Holway
TDWI STL 20140613 Agile - Paul Holway
 
Patching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP CloudPatching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP Cloud
 
Dev ops training in chennai
Dev ops training in chennaiDev ops training in chennai
Dev ops training in chennai
 
Puppet Labs EMC DevOps Day NYC Aug-2015
Puppet Labs  EMC DevOps Day NYC Aug-2015Puppet Labs  EMC DevOps Day NYC Aug-2015
Puppet Labs EMC DevOps Day NYC Aug-2015
 
How to Build High-Performing IT Teams - Including New Data on IT Performance ...
How to Build High-Performing IT Teams - Including New Data on IT Performance ...How to Build High-Performing IT Teams - Including New Data on IT Performance ...
How to Build High-Performing IT Teams - Including New Data on IT Performance ...
 
Deliver Fast and Reliably with Dev Ops and Atlassian
Deliver Fast and Reliably with Dev Ops and AtlassianDeliver Fast and Reliably with Dev Ops and Atlassian
Deliver Fast and Reliably with Dev Ops and Atlassian
 
Quality Testing and Agile at Salesforce
Quality Testing and Agile at Salesforce Quality Testing and Agile at Salesforce
Quality Testing and Agile at Salesforce
 
Applying both of waterfall and iterative development
Applying both of waterfall and iterative developmentApplying both of waterfall and iterative development
Applying both of waterfall and iterative development
 
Erp implementation guide
Erp implementation guideErp implementation guide
Erp implementation guide
 
Sdec10 lean package implementation
Sdec10 lean package implementationSdec10 lean package implementation
Sdec10 lean package implementation
 
Agile Course Presentation
Agile Course PresentationAgile Course Presentation
Agile Course Presentation
 
Agile 101
Agile 101Agile 101
Agile 101
 
An Agile Overview @ ShoreTel Sky
An Agile Overview @ ShoreTel SkyAn Agile Overview @ ShoreTel Sky
An Agile Overview @ ShoreTel Sky
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 

Sre summary

  • 1. A quick summary of SRE – Site Reliability Engineering Yogesh shah
  • 2. Agenda • What is SRE & its background • Before going to SRE • SRE and DevOps • Components of SRE • Reliability • SLA • SLO • SLI • Error budget • Toil • Things we did not cover • References
  • 3. What is SRE, History & Background SRE = Site Reliability Engineering Term SRE originated in google more than decade ago and it has been backbone of Google’s highly reliable & valuable suite of products & service Google didn’t make details of SRE public as it thought that it is the secrete sauce of their success When DevOps movement stated, google could see that there is lot of interest in implementing DevOps but there is no clear path and people are struggling to implement DevOps
  • 4. Scrum, SAFe, Lean, DevOps ………….. now SRE…  • Framework Direction: Dev  Ops • Flexibility: Rigid  open for interpretation • Ease of implementation Easy  very hard • Fit for market demand Less  high Software delivery mechanism What is at the center Type Advantages Difficulties Waterfall/ Project management Centers around: Plan Outcome: Fixed target Process • Easy to implement • Scope, time, cost fixed • Changing requirement • Too heavy, complex & costly ITIL Centers around: SLA Outcome: Predefined service quality Framework • Easy to implement • Clear accountability • Predictable service quality • Meet SLA != customer satisfaction • Too heavy & complex Scrum/ SAFe Centers around: Timebox, Focus Outcome: delivery of Changing requirement Framework • Simple to understand • Difficult to implement • Works best in pockets but consistency is hard to achieve Lean Centers around: Flow of work Outcome: Removal of waste Methodology • Easy to implement • Clear accountability • Predictable service quality • Meet SLA != customer satisfaction • Too heavy & complex DevOps Centers around: Unify Dev & Ops Outcome: End to end accountability for Dev & Ops Philosophy • Great vision • Open to interpretation
  • 5. What is SRE in comparison of others • Centers around: Reliability • Outcome: Customer satisfaction with control over balance of Enhancement & Reliability • Type: Implementation pattern • Advantage: Implements DevOps, • Disadvantage: None  • Addresses so far neglected question “is system ready to handle change without impacting customer experience?” • SRE happens when a software engineer is tasked with what used to be called operations.
  • 6. SRE and DevOps But what is DevOps? DevOps is about combined team (Dev & Ops) using common set of tools & processes to deliver any software change SRE is an implementation of DevOps. DevOps Reduce organization silos Accept failure as normal Implement gradual change Leverage tooling & automation Measure everything SRE Share ownership with developers by using the same tools and techniques across the stack Have a formula for balancing accidents and failures against new releases Encourage moving quickly by reducing costs of failure Encourages "automating this year's job away" and minimizing manual systems work to focus on efforts that bring long-term value to the system Believes that operations is a software problem, and defines prescriptive ways for measuring availability, uptime, outages, toil, etc.
  • 8. Defining Reliability •Clunky system with great features doesn’t work •100% reliability is most often wrong target as it slows down velocity •Reliability beyond a certain point has diminishing returns •Each 9 after decimal point makes system 10 time more reliable but it costs 10 time more Most important feature of any system is its Reliability •User, not monitoring metrics decide reliability hence in order to say system is reliable one needs to measure user experienceUser Experience decides Reliability •To achieve highly reliable (99.999…) systems well trained incident response team (proactive & reactive) is required. Only talented developers & well engineered system is not enough Only engineering & talented developer are not enough for highly reliable systems. Well trained incident response team is must
  • 9. Reliability • SRE helps defining reliability in clear way using concept of an error budget • Due to error budget understanding of reliability is understood consistently across organization • 100% reliability is wrong target as it slows down velocity • User happiness and reliability is directly proportional till a point beyond that user doesn’t care
  • 10. SLA • These are your agreements that you make with your customers about the reliability of your service. An SLA has to have consequences if it's violated • Violating SLAs is costly affair in many aspects & hence getting a informative warning with enough time to react is must to prevent violation of SLA
  • 11. SLO – Service Level Objectives • Reliability is a feature hence it is prioritized against other functional features. However prioritizing Reliability is challenging and SLOs are key to help in prioritizing Reliability along with other features • Target for specified reliability is SLO. In other words SLO is used to measure reliability • SLO should always be stronger than your SLAs because customers are usually impacted before the SLA is actually breached. • SLO is effectively an internal promise to meet customer expectations. Violation of SLO becomes really important issue as you are no longer have more outages so that you'll want to take steps to remove risks from your service by devoting engineering and automation efforts to reducing and eliminating areas of risks, etc. • A good rule of thumb to set SLO targets is “happiness test” A threshold beyond which user tends to become grumpy due to degraded service performance • So Setting identifying and selecting SLO target is important but tough task and SRE has clear guidelines to identify SLOs, set targets and revise SLO, Targets or both
  • 12. SLI – Service Level Indicators What is SLI • Now we understand what is Reliability but how do we measure it? • Reliability of service should be quantitative measure of customer experience. SRE helps you to find suitable metric based on characteristics of your service • The chosen metrics to measure level service provided to user is called SLI. In simple words It is a quantitative measure of user experience • Implementation to measure SLI metric changes based on implementation and environment where service is operating Relationship between SLI & SLO • SLI is how is the service performing against that target at the given point in time • SLO is the target we chose and measure SLI for period of time (e.g. 99% of requests are served within 2 seconds in last 4 weeks) • SLI will tell us if certain time is good or bad based on measure of SLI against SLO target • SLOs can be different for different times, different customer types, frequency of SLO misses etc. however concept of error budget helps you manage this How SRE helps • SRE provide SLI menu for typical user journey (system characteristics) • SRE provides simple formula to measure SLIs. It is always ratio (good events/ valid events) • Provides blueprints to implement SLI capture mechanism along with tradeoffs
  • 13. Error Budget • Identifying, documenting and agreeing SLOs and SLIs can be great progress but how can we make all this work? • Error budget is useful • actively balance Reliability of system against progress of other features in coherent manner • To inform all how much head room is available before impacting customer experience • It quantitatively informs how much failure or unreliability is allowed • E.g. • If intended reliability is 99.9% that means error budget is 0.1% • 0.1% error budget = 40.32 mins of downtime over 28 days • These 40.32 mins is SLO which we agree with all stakeholder. That means we have 40.32 mins for recovering from any failure. Failure can be because of any reason hdd failure, bad code, maintenance error, etc. • It prompts lot of useful thinking. • Assume that Reliability for your platform is 95% in 28 days. That means you are allowed to have 1.4 days of down time. Now do you really need CI-CD, Blue green deployment, test automation etc.?
  • 14. Toil • Toil is work related to running production system/ service • Toil satisfies following conditions • manual • Repetitive • Automatable • tactical • devoid of long-term value • Overhead (attending meeting, responding to email, etc.) is not a Toil
  • 15. Not covered • Detail steps and workshops for developing SLOs and SLIs • Setting achievable SLO targets • Define SLIs • Manage growth of SLI parameter • SLI menu, implementation patterns, tradeoffs and cost analysis • Define and analyze error budget • Error budget policy, thresholds and scenarios • Identify and address SLO risks • Consequences of missing SLO • There is much more
  • 16. References • SRE Introduction – Set of videos about SRE introduction • SRE – How google runs production systems • SRE Workbook – Practical ways to implement SRE