"Getting started with Site Reliability Engineering (SRE): A guide to improving systems reliability at production"
This is an intro guide to share some of the common concepts of SRE to a non-technical audience. We will look at both technical and organizational changes that should be adopted to increase operational efficiency, ultimately benefiting for global optimizations - such as minimize downtime, improve systems architecture & infrastructure:
- improving incident response
- Defining error budgets
- Better monitoring of systems
- Getting the best out of systems alerting
- Eliminating manual, repetitive actions (toils) by automation
- Designing better on-call shifts/rotations
How to design the role of the Site Reliability Engineer (who effectively works between application development teams and operations support teams)
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
Getting started with Site Reliability Engineering (SRE)
1. Abeer Rahman
Toronto Agile Conference
Oct 2018
Getting started with
Site Reliability Engineering (SRE):
A guide to improving systems reliability
at production
@abeer486
8. The goal is not just for software to be pushed out, but to also be
maintained once it’s live
Sanity Check
How long does it take for an incident to reach the right team?
Can it be minutes, instead of hours?
How often does the same incident re-occur?
Can it be reduced to never?
9. Reliability is the outcome of a system to be able
to withstand some certain types of failure and still
remain functional from a user’s perspective.
10. …is what happens when a software engineer is put to solve
operations problems
Site Reliability Engineering…
Site reliability engineering (SRE) is a specialized skillset where software
engineers develop and contribute to the ongoing operation of systems in
production.
Site
Reliability
Engineer
Spend 50% of
contributions in
Systems Engineering
problems
--
infrastructure,
operating system &
network, databases
--
availability,
observability, efficiency
& reliability of services
Spend 50% of
contributions in
Software Engineering
problems
--
application design, data
structure, service design
& key integrations
--
development tasks that
contribute to systems
reliability
11. SRE team objectives
Site Reliability Engineers spend less than 50% of their time performing operational work to
allow them to spend more time on improving infrastructure and task automation
SRE approach to Operations:
• Treat operations like a software engineering problem
• Automate tasks normally done by system administers (such as
deployments)
• Design more reliable and operable service architectures from the
ground-up, not an afterthought
Support systems in production Contribute to Project work
12. The evolution of systems requires an evolution of systems
engineers
Why another role/discipline?
Common issues:
• Reliability is usually left to the architects to design, typically at the beginning
of a product design
• Non-functional requirements are not reviewed as often
• Change of functionality may impact previously assumed reliability
requirements
• Most outages are typically caused by change, and a typical Ops mindset does
not necessarily prepare for fast remedy
In the new world, think of systems as software-defined.
14. DevOps vs. SRE?
DevOps:
A set of principles & culture guidelines
that helps to breakdown the silos
between development and operations
/ networking / security
• Reduce silos
• Plan for failure
• Small batch changes
• Tooling & automation
• Measure everything
Site Reliability Engineering:
A set of practices with an emphasis of
strong engineering capabilities that
implement the DevOps practices, and
sets a job role + team
• Reduce silos by shared ownership
• Plan for failure using error budgets
• Small batch changes with focus on
stability
• Tooling & automation of manual
tasks
• Measure everything monitoring the
right things
SRE implements DevOps
15. SREs aim create more value in linking Operations activities with
development
This pattern is used by many
organizations that have a high
degree of organizational &
engineering maturity.
• SRE often act as gate-keepers
for production readiness
• Sub-standard software is
rejected and SREs provide
support in refortifying
potential operational issues
https://web.devopstopologies.com/
OpsSREDevOpsDev
16. Shared Responsibility Model
Application development teams take part in supporting applications in
productions
• SRE work can flow into application development teams
• 5% of operational work to developers
• Take part in on-calls (common these days)
• Give the pager to the developer once the reliability objectives are met
18. Enabling successful SRE teams requires buy-in across the entire
organization
Rollout of SRE in an Organization
The values of the company must align with how SRE operates
• Organizational mandate
• Leadership buy-in
• Shared Responsibilities
19. Horizontal enablement of SRE across the organization
Model 1: Centralized SRE Teams as a shared service
Centralized teams
Platform Operations
Product
Team
SRE
… … …
Product Mgmt.
Centralized SRE team that can act as a
shared service for all of the
organization:
• Common backlog of organization’s
reliability improvements – so rolled
out evenly
• SREs are brought in as part of
design, review, validate, go-lives
• Limited level of influence within team
Sample organizations:
• Facebook (via Production
Engineering)
20. Placing SREs in vertical slices across the organization
Model 2: Horizontal enablement across products/teams
(Embedded with)
Application teams
SREs embedded as full-time team
members as part of the product
teams:
• Treated like a team member of the
product team
• Regular contribution to the overall
product reliability design &
implementation
Sample organizations:
• Google
• Bloomberg
• Uber
Mature state: Pairing with application
team to uplift the application, then
rotate out
Platform Operations
Product
Team
SRE
Product Mgmt.
…SRE
…SRE
…SRE
Platform Operations
21. Using these as a baseline helps to track progress
Key metrics to measures during enablement of SRE
• Mean Time to Restore (MTTR) service
• Lead time to release (or rollback)
• Improved monitoring to catch & detect issues earlier
• Establishing error budgets to enable budget-based risk management
22. Finding key talent who have natural curiosity for software
systems and aggressively automates
Hiring & Onboarding SREs
Hiring for skillsets:
• Ideally, multi-skilled individuals who
have experience in writing software and
maintaining software
• Traditional Ops / Sys Admins who
are willing to learn scripting /
programming
• App developers who are willing to step
up for operational roles
• Exposure to and interest in systems
architecture
Onboarding SREs:
• SRE Bootcamps that consist of insight
into the system architecture, software
delivery process and operations
• On-call rotations with senior SRE
members
• Participation in incidents response
calls
24. How do we go about measuring availability of a system from the
eyes of a user?
Measuring reliability
• Easy to measure through
systematic means (e.g., server
uptime)
• Difficult to judge success for
distributed systems
• Does not really speak on behalf of
the user’s experience
Availability =
actual system uptime
anticipated system uptime
• Measures the amount for real users
for whom the service is working
• Better for measuring distributed
systems
Availability =
good interactions
total interactions
25. Taking one step further, setup a budget that allows for taking
risks
Budget for unreliability
100% - Availability target = Budget of unreliability
or, the Error Budget
Method:
1. SRE and Product Management jointly define an availability target
2. After that, an error budget is established
3. Monitoring (via uptime) is set in place to measure performance +
metrics relative to the objective set
4. A feedback loop is created for maintaining / utilizing the budget
26. Benefits of having Error Budgets
Shared responsibility for uptime
Functionality failures as well as infrastructure issues fall into the same error budget
Common incentive for Product teams & SREs
Creates a balance between reliability and feature innovation
Product teams can prioritize capabilities
The team now can decide where and how to spend the error budget
Reliability goals become credible & achievable
Provides a more pragmatic approach to setting a realistic availability
Error budgets become an explicit trackable metric that glues
product feature planning to service reliability
27. Setting up a reliability goal depends on the nature of the
business
Unavailability Window
Availability
Target
Allowed Unavailability Window
per year per 30 days
95 % 18.25 days 1.5 days
99 % 3.65 days 7.2 hours
99.9 % 8.76 hours 43.2 minutes
99.99 % 52.6 minutes 4.32 minutes
99.999 % 5.26 minutes 25.9 seconds
Question: Why is 100% reliability a not a good idea?
28. Product Management with SREs define service levels for the
systems as part of product design
Approach to defining Service Levels
Service levels allow us to:
• Balance features being built with operational reliability
• Qualify and quantify the user experience
• Set user expectations for availability and performance
29. Method to defining Service Levels
Service Level Agreement (SLA)
• A business contract the service provider + the customer
• Loss of service could be related to loss of business
• Typically, an SLA breach would mean some form of compensation
to the client
Service Level Objective (SLO)
• A threshold beyond which an improvement of the service is
required
• The point at which the users may consider opening up support
ticket, the "pain threshold", e.g., YouTube buffering
• Driven by business requirements, not just current performance
Service Level Indicator (SLI)
• Quantitative measure of an attribute of the service, such as
throughput, latency
• A directly measurable & observable by the users – not an internal
metric (response time vs. CPU utilization)
• Could be a way to represent the user's experience
30. Group exercise: Setting up Service Levels
Care-Van is an app that helps to find
& book van owners who are available
to give lifts to elderly people who have
mobility issues.
You are an SRE working with the
product team that designs the ride
request for the customers.
What do you suggest are good
measures/metrics of the following:
• Service Level Indicator (SLI)
• Service Level Objective (SLO)
• Service Level Agreement (SLA)
5 minutes
For Reference
SLI: a measurable attribute of a
service that represents
availability/performance of the
system
SLO: defines how the service
should perform, from the
perspective of the user
(measured via SLI)
SLA: a binding contract to
provide a customer some form of
compensation if the service did
not meet expectations
31. Sample solution
Service Level Indicator (SLI):
the latency of successful HTTP responses (HTTP 200)
Service Level Objective (SLO):
The latency of 95% of the responses must be less than 200ms
Service Level Agreement (SLA):
Customer compensated if the 95th percentile latency exceeds 300ms
32. Exhausting an Error Budget
As long as there is remaining error budget (in a rolling
timeframe), new features can be deployed.
Question: what can you do if you exhaust the error budget?
Possibly:
• No new feature launches (for a period of time)
• Change the velocity of updates
• Prioritize engineering efforts to reliability
34. It is the primary means in determining and maintaining
reliability of a system
Monitoring is a foundational capability for any organization
Product
Development
Capacity Planning
Test/Release Process
Post Mortems
Incident Response
Monitoring
Mikey Dickerson’s Hierarchy of Service Reliability
Monitoring helps with
identifying what needs
urgent attention, and
helps prioritization:
• Urgency
• Trends
• Planning
• Improvements
35. Golden Signals of Monitoring
Golden Signals
Latency
Traffic
Errors
Saturation
The case for better monitoring:
• Focused more on uptime to deduce
availability vs. user experience
• Monitoring system metrics, like CPU,
memory, storage are good indicators,
however may not mean anything
meaningful for the client experience
• “Noise overload”
It’s better to other metrics, such as
latency: are the response times taking
longer & longer?
It’s best to define
SLIs & SLOs using
these golden signals
as it accounts for
user’s perception of
the system
36. Outputs of a good monitoring system
Logging
Alerts
Tickets
Level of
urgency
A carefully crafted
monitoring system should
be able to distinguish
urgent requests from
important ones, and only
alert when required
37. An alert should only be paged for urgent requests that need
immediate attention
Alerting design consideration
Common issues:
• Not all alerts are helpful
• Alerts is not maintained regularly to keep up
with product functionality being rolled out
Good practices:
• Measure the system along with all of the parts,
but alert only on problems will cause or is
causing end-user pain. This helps to reduce
“alert fatigue”
• Alerts should be actionable (consider logging for
informational items)
39. Effective incident response process increases user trust
Incident Response
Users don’t know always expect
a service to be perfect.
However, they expect their
service provider to be
transparent, fast to resolve
issues, and learns quickly from
mistakes.
Incident Response helps to
create that sense of trust when
there is a crisis.
Product
Development
Capacity Planning
Test/Release Process
Post Mortems
Incident Response
Monitoring
Dickerson’s Hierarchy of Service Reliability
40.
41. People do not typically work well in situations of emergencies, so
having a structured incident response process is essential
Structured Incident Response
Have sufficient monitoring
• Service dashboards indicating throughputs, loads, latency, etc.
• Typically, an SLA breach would mean some form of compensation to the
client
Good alerting & communication process
• Notifying on-call member(s) and having an automated escalation process
if primary on-call is not reachable
• Notifying end-users before or as soon as there is degraded experience
Playbook-style plans & tools
• An easily accessible step-by-step checklist of actions to do when an
incident occurs
• Tools available for system introspection
1
2
3
Great resource:
https://www.incidentresponse.com/playbooks/improper-computer-usage
42. A post-mortem conducted after an incident helps to build a
culture of self-improvement
Post-mortems
Goals of writing a post-mortem:
• All contributing causes are understood and documented
• Action items are put in place to avoid repeating the same issue again
Good habits:
• Post-mortems conducted & closed within 3-5 days after an incident has
occurred
• Document in details errors, logs, steps taken and decisions made that led to
the incident/outage
• Document fixes if an SLO is breached, even if the SLA is not impacted
44. Blameless culture enables faster experimentation without fear.
Because failures are part of the process of experimentation.
Conducting blameless post-mortem
“Human errors” are
system problems. It’s
better to fix systems
& processes to better
support people in
making good choices.
If the culture of
finger-point prevails,
there is no incentive
for team members to
bring forth issues in
fear of punishment.
Useful approach & questions:
• Conduct Five Why’s
• How can you provide better information to
make decisions?
• Was the information misleading? Can this
information be fixed in an automated way?
• Can the activity be automated so it requires
minimal human intervention?
• Could a new hire have made the same mistake
the same way?
John Allspaw’s reference paper on making decisions under pressure:
http://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=8084520&fileOId=8084521
45. Gitlab’s database
outage post-mortem
in January 2017
Transparency
Accountability
Blamelessness
https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/
46. Create a work environments that have a balance between
planned/project-driven work and interrupt-driven work
Better On-Calls
Good practices for creating a more sustainable on-call:
• Have a single person be in charge of on-call (Note: if no one is on call, then
everyone becomes on call!)
• On-call person makes sure interrupt-based work is completed without
impacting others who are not on-call
• Create a rotation schedule to ensure relief
• No more than 2 key incidents per on-call engineer per shift(or staff
appropriately)
Project-driven work
Planned work to reduce toil
Interrupt-driven work
Supporting incidents
48. Technology capabilities that help with speed & reliability
Microservices
Microservices provide a
foundation to support
automation, rapid
deployment, and DevOps
infrastructure.
Cloud
Cloud architecture is a
direct response to the
need for agility.
Containers
Infrastructure manages
services and software
deployments on IT
resources where
components are
replaced rather than
changed.
49. Considerations for storage, compute, networking…
Infrastructure Design
Good practices:
• Version control your infrastructure design and code
• Offload as much work to cloud provider
• Design for reliability by having the application + data
situated in more than one datacenter
• Design your infrastructure architecture to support
Canary deployment
50. Using the cloud as an enabler for reliability
Distribute your infrastructure across multiple datacenters (or
regions/zones)
Sample taken from Google Cloud Platform, however, most cloud providers have this capability.
51. Considerations for business logic, data layer and presentation…
Application design
Good practices:
• Re-think how your features can be re-architected to
support for scalability (hence increasing reliability)
• Favour scale out over scale up
• De-couple / modular architecture: such as Microservices
• Distribute your applications across multiple geographies
• Consistent build, deploy process (as much as possible)
52. Design your
applications using
the principles of
Twelve-Factor App
Enables “cloud-
readiness”
Helps with
increasing
reliability
https://12factor.net/
55. Get started with your own SRE journey!
Acknowledge: failures are inevitable, it’s best to plan for it
Buy-in: from leadership & top-management
Culture: creating a culture of self improvement
Skillset: look for multi-skilled individuals who strive on collaboration
1. Select a sample application
2. Identify it’s SLI/SLO
3. Create a baseline for current state
4. Incrementally update systems & processes to meet SLO
5. Rinse, repeat
56. Organization-wide Reliability Improvement backlog
Sources of improvement backlog:
• Previous outages and incidents, especially the ones that have been recurring
• Review of tools used in operations (monitoring, alerting, on-call schedule)
• Operational technical debt – automation of manual scripts
• Mass reboot of all your systems in the whole datacenter
• Do a disaster-recovery (DR) test
57. Using these as a baseline helps to track progress…
Metrics to measures during enablement of SRE
• Mean Time to Restore (MTTR) service
• Lead time to release (or rollback)
• Improved monitoring to catch & detect issues earlier
• Establishing error budgets to enable budget-based risk management
58. …or make one that makes sense to you!
Metrics to measures during enablement of SRE
Mean time-to-return-to-bed
(MTTRTB)
59. What to watch out for when enabling SRE…
Anti-patterns!
• Rebranding your existing Operations team to SRE is not SRE!
• Treat SREs like any other software developers – not like operations
teams
• 100% is not a good SLO. It’s better to plan for failure as so many
interactions are out of your hands
• Do not do the transformation in silo. Teams are willing to change;
best to involve them as part of the transformation journey. E.g.
Change Management, Incident Management, Security