SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
Site Reliability engineering /
Emergency Procedures
Ashutosh Agarwal
Context: Launching new features vs reliability
● Product teams want to move fast
○ Deploy as many features as possible
○ Deploy as fast as possible
● Reliability teams want to be stable
○ Focus on reliability
○ Break as little as possible
Goal of SREs is to enable the product teams to move as fast as practically possible
without impacting users.
This talk
Users are having issues with your service
● What does “having issues with your service” mean ?
● How do we know if something went wrong ?
● What went wrong ?
● What action to take ?
● How to prevent these from occurring in future ?
Availability
Definition varies
● % Uptime
● % Requests served
Path towards high availability
● Add redundancy (N+2)
○ More replicas - webservers, database
○ Load Balancers
● Write more tests : integration tests
● More humans to respond to pagers
● Maintain checklist of possible bugs, and run through the list for each release
● Reduce release velocity
○ Release one feature at a time
○ Release a binary only after exhaustive automated & human testing
Quiz:
I want my
service
availability to be
… ?
❏ 97%
❏ 99.9%
❏ 99.999%
❏ 100%
Right answer
It depends. Trade-off between launch velocity
and system stability.
● A feature with low availability would be
unusable - always broken
● A feature with very high availability would
be unusable - very boring, features missing
Deciding on right availability target
● Infrastructure (shared service) vs user-facing service
● Mature Product vs New Product
● Paid vs Free Product
● High Revenue impact vs Low
Basically, it’s a very logical cost-benefit analysis.
Service Level Indicators (SLI),
Service Level Objectives (SLO)
Service Level Indicators
What are the metrics that matter ?
Latency
Throughput : Number of requests processed per second
Error Rate : % of requests answered
It is important that these metrics are clearly
thought-about and mutually agreed upon.
Service Level Objectives (SLOs)
What are the target values for your SLIs
Latency - 300ms.
Often, it is hard to define SLOs for some metrics.
In those cases:
- It is important to have a hard upper limit
- Enables abnormal behavior detection
- Gives a guiding benchmark to teams
Expected downtime
99% = 4 days of downtime a year
99.9% = 9 hours of downtime
99.99% = 1 hour of downtime
99.999% = 5 min of downtime
Error Budget
Ex: Target Availability 99.9%
Practically, the service is available 99.99% time.
The service has gained some error budget to launch new features faster now !
Likewise, vice-versa.
Reliable only up to the defined limit, no more
● Sometimes, systems may give false impression of being over-reliable
● Leads other teams to build with false assumptions
Extreme measure : Planned Downtime
Abnormality Detection
It would have been easy if …
Database
Server
Web
Server
A simple web application
But most often in reality
Load
Balancer
Database
Server
Database
Server
Master
/
Router
Cache
Layer
Service n
Database
Server
Database
Server
Master
/
Router
Cache
Layer
Service 1
Merge
Node
Web
Server
Web
Server
High-level architecture of a web-application
Monitoring
Without monitoring ...
● No way to know if my system is working, it is like flying in the dark
● Predicting a future issue, is not at all possible
Good monitoring
● Enables analysis of long-term trends
● Enables comparative analysis
○ 1 Machine with 32GB RAM vs 2 machines with 16 GM RAM each
● Enables conducting ad hoc analysis - latency shot up, what else changed ?
Good monitoring
● Never requires a human to interpret an alert
○ Humans add latency
○ Alert fatigue
○ Ignore, if alarms are usually false
● 3 kinds of monitoring output
○ Alert : True alarm - red alert. Human needs to take action.
○ Ticket : Alarm - yellow alert. Human needs to take action eventually
○ Logging : Not a alarm - no action needed.
Good Alerts
● Alert is a sign that something is
actually broken
Bad alert : Average latency is 125ms.
Good alert: Average latency is
breaking latency SLOs
● Alerts MUST be actionable.
● Alert for symptoms, not causes
What = Symptom, Why = (Possible)
Cause
“What” versus “Why” is an important
distinction between good monitoring
with maximum signal and minimal
noise.
Black box monitoring
● Mimics a system user
● Shows how your users view your
system
Types of monitoring
White box monitoring
● Viewing your system from the inside,
knowing very well how it functions
● For e.g each service keeps counters of
all backend service failures, latencies
Black-box monitoring example
Request
GET /mysite.html HTTP/1.1
Host: www.example.com
Response
HTTP/1.1 200 OK
Date: 01 Jan 1900
Content Type: text/html
<html> … </html>
White-box monitoring
Load
Balancer
Database
Server
Database
Server
Master
/
Router
Cache
Layer
Service n
Database
Server
Database
Server
Master
/
Router
Cache
Layer
Service 1
Merge
Node
Web
Server
Web
Server
qps: 100
peak: 500
num_shards: 9
backend_latency:
200ms
mem_size: 123 mb
peak: 1029 mb
backend_latency:
20ms
Life of an incident
10.59 PM
Error
404
11.00 PM
Error
404
Error
404
Error
404
Error
404
Error
404
10 users get 404 error in 1 minute.
11.00 PM
Error
404
Error
404
Error
404
Error
404
Error
404
10 users get 404 error in 1 minute.
Monitoring
system kicks in
Sends a
pager to
oncall
engineer /
SRE
In this case, black box monitoring kicked off an alert.
This is after the fact alert.
Sometimes, white-box monitoring can help predict an
issue is about to happen.
11.05 PM
Playbook
(s)
Verifies the
user impact
Finds out the
underlying
broken service
causing the
issue
Consults the
playbook of
underlying
service
Finds how to
disable a
particular
feature
White-box monitorning invariably plays a big role in figuring
out the origin of the problem.
Also, a good central logging system used by ALL servers
throughout the system.
11.10 PM
Load
Balancer
Database
Server
Database
Server
Master
/
Router
Cache
Layer
Service n
Database
Server
Database
Server
Master
/
Router
Cache
Layer
Service 1
Merge
Node
Web
Server
Web
Server
Issue
fixed
In summary
● Playbooks: guide to turning off features, feature flag names & descriptions
● Mean time to failure (MTTF)
● Mean time to recovery (MTTR)
Reliability = function (MTTF, MTTR)
● Statistically, having playbooks improves MTTR by 3x.
Culture of SREs
John is a hero.
John knows about the system more than
others.
John can fix this issue in under a minute.
John doesn’t need to read docs to fix it.
But there is only ONE John.
And John needs to sleep, eat …
And John can solve only ONE issue at a
time.
Avoid a culture of
heros.
Preventing future hiccups
Release management : Progressive rollouts
● Canary servers
○ A few production instances serving traffic from new binary
● Canary for sufficient time
○ Give the new binary some time to experience all possible use cases
● Canary becomes prod on green
● Changes be backward compatible with 2-3 releases
○ Binaries can be safely rolled back
No-blame postmortem
● Teams write detailed postmortem after resolving outages
● Not attributed to any single person / team
● Postmortem is structured as:
○ What went wrong
○ What was the immediate fix
○ What is the long term fix
○ What can we do to prevent it occurring in the future ?
Catastrophe Testing (Artificial)
Dedicated days to stress test the system
● Bring down a few machines
● Lose network connectivity
In summary
● Judiciously decide on what metrics to measure, and target values
● Monitor both user facing as well as internal metrics
● Alert only when necessary
● Plan for failure management - dashboards, playbooks and uniform logging
● Avoid a culture of heros
● Learn from failures
Questions ?

Weitere ähnliche Inhalte

Was ist angesagt?

Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaKeet Sugathadasa
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationDr Ganesh Iyer
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)Setyo Legowo
 
Site reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkSite reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkMichae Blakeney
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)Hussain Mansoor
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...DevClub_lv
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIKnoldus Inc.
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRESquadcast Inc
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLADr Ganesh Iyer
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsRauno De Pasquale
 
When down is not good enough. SRE On Azure - PolarConf
When down is not good enough. SRE On Azure - PolarConfWhen down is not good enough. SRE On Azure - PolarConf
When down is not good enough. SRE On Azure - PolarConfRene Van Osnabrugge
 
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7NUS-ISS
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...DevOpsDays Tel Aviv
 
What's an SRE at Criteo - Meetup SRE Paris
What's an SRE at Criteo - Meetup SRE ParisWhat's an SRE at Criteo - Meetup SRE Paris
What's an SRE at Criteo - Meetup SRE ParisClément Michaud
 
Managing software projects & teams effectively
Managing software projects & teams effectivelyManaging software projects & teams effectively
Managing software projects & teams effectivelyAshutosh Agarwal
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)jeetendra mandal
 

Was ist angesagt? (20)

Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil Elimination
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
 
Site reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkSite reliability engineering - Lightning Talk
Site reliability engineering - Lightning Talk
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)
 
Sre summary
Sre summarySre summary
Sre summary
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRE
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
 
SRE 101
SRE 101SRE 101
SRE 101
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
 
When down is not good enough. SRE On Azure - PolarConf
When down is not good enough. SRE On Azure - PolarConfWhen down is not good enough. SRE On Azure - PolarConf
When down is not good enough. SRE On Azure - PolarConf
 
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
 
SRE vs DevOps
SRE vs DevOpsSRE vs DevOps
SRE vs DevOps
 
What's an SRE at Criteo - Meetup SRE Paris
What's an SRE at Criteo - Meetup SRE ParisWhat's an SRE at Criteo - Meetup SRE Paris
What's an SRE at Criteo - Meetup SRE Paris
 
Managing software projects & teams effectively
Managing software projects & teams effectivelyManaging software projects & teams effectively
Managing software projects & teams effectively
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)
 

Ähnlich wie Overview of Site Reliability Engineering (SRE) & best practices

Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafkaconfluent
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Brian Brazil
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUGslandelle
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
Sistemas Distribuidos
Sistemas DistribuidosSistemas Distribuidos
Sistemas DistribuidosLocaweb
 
Tef con2016 (1)
Tef con2016 (1)Tef con2016 (1)
Tef con2016 (1)ggarber
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesEd Hunter
 
The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call Raygun
 
Monitoring &amp; alerting presentation sabin&amp;mustafa
Monitoring &amp; alerting presentation sabin&amp;mustafaMonitoring &amp; alerting presentation sabin&amp;mustafa
Monitoring &amp; alerting presentation sabin&amp;mustafaLama K Banna
 
Ensuring Your Technology Will Scale
Ensuring Your Technology Will ScaleEnsuring Your Technology Will Scale
Ensuring Your Technology Will Scalebasissetventures
 
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization TechniqueSquare Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization TechniqueScyllaDB
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithNETWAYS
 
Benchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbersBenchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbersJustin Dorfman
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...james tong
 
Nfr testing(performance)
Nfr testing(performance)Nfr testing(performance)
Nfr testing(performance)Dilip Sharma
 

Ähnlich wie Overview of Site Reliability Engineering (SRE) & best practices (20)

Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUG
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Sistemas Distribuidos
Sistemas DistribuidosSistemas Distribuidos
Sistemas Distribuidos
 
Tef con2016 (1)
Tef con2016 (1)Tef con2016 (1)
Tef con2016 (1)
 
HowTo DR
HowTo DRHowTo DR
HowTo DR
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 
The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call The on-call survival guide - how to be confident on-call
The on-call survival guide - how to be confident on-call
 
Diesel load testing tool
Diesel load testing toolDiesel load testing tool
Diesel load testing tool
 
Monitoring &amp; alerting presentation sabin&amp;mustafa
Monitoring &amp; alerting presentation sabin&amp;mustafaMonitoring &amp; alerting presentation sabin&amp;mustafa
Monitoring &amp; alerting presentation sabin&amp;mustafa
 
Ensuring Your Technology Will Scale
Ensuring Your Technology Will ScaleEnsuring Your Technology Will Scale
Ensuring Your Technology Will Scale
 
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization TechniqueSquare Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles Judith
 
Benchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbersBenchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbers
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...
 
sat_presentation
sat_presentationsat_presentation
sat_presentation
 
Nfr testing(performance)
Nfr testing(performance)Nfr testing(performance)
Nfr testing(performance)
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Kürzlich hochgeladen (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Overview of Site Reliability Engineering (SRE) & best practices

  • 1. Site Reliability engineering / Emergency Procedures Ashutosh Agarwal
  • 2. Context: Launching new features vs reliability ● Product teams want to move fast ○ Deploy as many features as possible ○ Deploy as fast as possible ● Reliability teams want to be stable ○ Focus on reliability ○ Break as little as possible Goal of SREs is to enable the product teams to move as fast as practically possible without impacting users.
  • 3. This talk Users are having issues with your service ● What does “having issues with your service” mean ? ● How do we know if something went wrong ? ● What went wrong ? ● What action to take ? ● How to prevent these from occurring in future ?
  • 4. Availability Definition varies ● % Uptime ● % Requests served
  • 5. Path towards high availability ● Add redundancy (N+2) ○ More replicas - webservers, database ○ Load Balancers ● Write more tests : integration tests ● More humans to respond to pagers ● Maintain checklist of possible bugs, and run through the list for each release ● Reduce release velocity ○ Release one feature at a time ○ Release a binary only after exhaustive automated & human testing
  • 6. Quiz: I want my service availability to be … ? ❏ 97% ❏ 99.9% ❏ 99.999% ❏ 100%
  • 7. Right answer It depends. Trade-off between launch velocity and system stability. ● A feature with low availability would be unusable - always broken ● A feature with very high availability would be unusable - very boring, features missing
  • 8. Deciding on right availability target ● Infrastructure (shared service) vs user-facing service ● Mature Product vs New Product ● Paid vs Free Product ● High Revenue impact vs Low Basically, it’s a very logical cost-benefit analysis.
  • 9. Service Level Indicators (SLI), Service Level Objectives (SLO)
  • 10. Service Level Indicators What are the metrics that matter ? Latency Throughput : Number of requests processed per second Error Rate : % of requests answered It is important that these metrics are clearly thought-about and mutually agreed upon.
  • 11. Service Level Objectives (SLOs) What are the target values for your SLIs Latency - 300ms. Often, it is hard to define SLOs for some metrics. In those cases: - It is important to have a hard upper limit - Enables abnormal behavior detection - Gives a guiding benchmark to teams
  • 12. Expected downtime 99% = 4 days of downtime a year 99.9% = 9 hours of downtime 99.99% = 1 hour of downtime 99.999% = 5 min of downtime
  • 13. Error Budget Ex: Target Availability 99.9% Practically, the service is available 99.99% time. The service has gained some error budget to launch new features faster now ! Likewise, vice-versa.
  • 14. Reliable only up to the defined limit, no more ● Sometimes, systems may give false impression of being over-reliable ● Leads other teams to build with false assumptions Extreme measure : Planned Downtime
  • 15.
  • 17. It would have been easy if … Database Server Web Server A simple web application
  • 18. But most often in reality Load Balancer Database Server Database Server Master / Router Cache Layer Service n Database Server Database Server Master / Router Cache Layer Service 1 Merge Node Web Server Web Server High-level architecture of a web-application
  • 19. Monitoring Without monitoring ... ● No way to know if my system is working, it is like flying in the dark ● Predicting a future issue, is not at all possible Good monitoring ● Enables analysis of long-term trends ● Enables comparative analysis ○ 1 Machine with 32GB RAM vs 2 machines with 16 GM RAM each ● Enables conducting ad hoc analysis - latency shot up, what else changed ?
  • 20. Good monitoring ● Never requires a human to interpret an alert ○ Humans add latency ○ Alert fatigue ○ Ignore, if alarms are usually false ● 3 kinds of monitoring output ○ Alert : True alarm - red alert. Human needs to take action. ○ Ticket : Alarm - yellow alert. Human needs to take action eventually ○ Logging : Not a alarm - no action needed.
  • 21. Good Alerts ● Alert is a sign that something is actually broken Bad alert : Average latency is 125ms. Good alert: Average latency is breaking latency SLOs ● Alerts MUST be actionable. ● Alert for symptoms, not causes What = Symptom, Why = (Possible) Cause “What” versus “Why” is an important distinction between good monitoring with maximum signal and minimal noise.
  • 22. Black box monitoring ● Mimics a system user ● Shows how your users view your system Types of monitoring White box monitoring ● Viewing your system from the inside, knowing very well how it functions ● For e.g each service keeps counters of all backend service failures, latencies
  • 23. Black-box monitoring example Request GET /mysite.html HTTP/1.1 Host: www.example.com Response HTTP/1.1 200 OK Date: 01 Jan 1900 Content Type: text/html <html> … </html>
  • 24. White-box monitoring Load Balancer Database Server Database Server Master / Router Cache Layer Service n Database Server Database Server Master / Router Cache Layer Service 1 Merge Node Web Server Web Server qps: 100 peak: 500 num_shards: 9 backend_latency: 200ms mem_size: 123 mb peak: 1029 mb backend_latency: 20ms
  • 25. Life of an incident
  • 28. 11.00 PM Error 404 Error 404 Error 404 Error 404 Error 404 10 users get 404 error in 1 minute. Monitoring system kicks in Sends a pager to oncall engineer / SRE In this case, black box monitoring kicked off an alert. This is after the fact alert. Sometimes, white-box monitoring can help predict an issue is about to happen.
  • 29. 11.05 PM Playbook (s) Verifies the user impact Finds out the underlying broken service causing the issue Consults the playbook of underlying service Finds how to disable a particular feature White-box monitorning invariably plays a big role in figuring out the origin of the problem. Also, a good central logging system used by ALL servers throughout the system.
  • 31. In summary ● Playbooks: guide to turning off features, feature flag names & descriptions ● Mean time to failure (MTTF) ● Mean time to recovery (MTTR) Reliability = function (MTTF, MTTR) ● Statistically, having playbooks improves MTTR by 3x.
  • 33. John is a hero. John knows about the system more than others. John can fix this issue in under a minute. John doesn’t need to read docs to fix it. But there is only ONE John. And John needs to sleep, eat … And John can solve only ONE issue at a time. Avoid a culture of heros.
  • 35. Release management : Progressive rollouts ● Canary servers ○ A few production instances serving traffic from new binary ● Canary for sufficient time ○ Give the new binary some time to experience all possible use cases ● Canary becomes prod on green ● Changes be backward compatible with 2-3 releases ○ Binaries can be safely rolled back
  • 36. No-blame postmortem ● Teams write detailed postmortem after resolving outages ● Not attributed to any single person / team ● Postmortem is structured as: ○ What went wrong ○ What was the immediate fix ○ What is the long term fix ○ What can we do to prevent it occurring in the future ?
  • 37. Catastrophe Testing (Artificial) Dedicated days to stress test the system ● Bring down a few machines ● Lose network connectivity
  • 38. In summary ● Judiciously decide on what metrics to measure, and target values ● Monitor both user facing as well as internal metrics ● Alert only when necessary ● Plan for failure management - dashboards, playbooks and uniform logging ● Avoid a culture of heros ● Learn from failures