The more we are connected and the more others are connected to us, the more important reliability of your sites becomes.
Site Reliability Engineering is an engineering discipline devoted to helping an organization sustainably achieve the appropriate level of reliability in their systems, services, and products. But what does this mean, and how do get started with this?
In this session I will talk about the concepts of Site Reliability Engineering and use Microsoft Azure to implement some of the concepts and practices.
You will learn:
What is Site Reliability Engineering?
How can you get started with SRE?
How to use Azure to implement some of the SRE concepts?
7. WHAT WE HEAR FROM
OPS
“We need to have a launch review!”
“Please let the CAB approve first”
“This is our change management checklist”
“We should validate first in Test and Pre-Prod”
@RENEV
8. WHAT WE HEAR FROM
DEV
“We do not launch big changes, this is just a
flag flip”
“This is a hotfix!”
“This is just a UI change... No big thing”
“Only a 20% experiment.”
@RENEV
13. SITE RELIABILITY ENGINEERING IS AN
ENGINEERING DISCIPLINE DEVOTED
TO HELPING AN ORGANIZATION
SUSTAINABLY ACHIEVE THE
APPROPRIATE LEVEL OF RELIABILITY
IN THEIR SYSTEMS, SERVICES, AND
PRODUCTS.
@RENEV
20. SLIS ARE A RATIO/PROPORTION
# of successful HTTP calls/# of HTTP calls
# of operations that completed in < 10ms/# of operations
# of “full quality responses”/# of responses
# of records processed/# of records
Ratio * 100 = % proportion
@RENEV
21. SLIS ARE A RATIO/PROPORTION & HOW
# of successful HTTP calls/# of HTTP calls
# of operations that completed in < 10 ms/# of operations
…as measured at the client
…as measured at the load balancer
@RENEV
23. BASIC SLO RECIPE
THE THING
HTTP requests
Storage checks
Operations
SLI PROPORTIONS
”Successful 50% of the time”
“Can read the data 99.9% of the time”
“Return in 10ms 90% of the time”
TIME STATEMENT
“In the last ten minute period”
”During last quarter”
“In the previous rolling 30 day period”
90% of HTTP requests
as reported by the
load balancer
succeeded in the last
30 day window.
SERVICE LEVEL OBJECTIVE (SLO)
@RENEV
40. WHAT DOES AN SRE DO
TIME ALLOCATION OF SRE
50%
Operational Work
50%
Project Work
• Incidents
• Tickets
• Operational work
• Project Work in Product
Teams
• Add Service Features
• Reduce future toil
@RENEV
41. TOIL
• Manual
• Repetitive
• Automatable
• Tactical
• Devoid of enduring value
• cales linearly as a service grows
@RENEV
51. ACTIONABLE ALERTS
• Alerts are not: logs, notifications, heartbeats, normal
• Needs a human to investigate (and ideally resolve)
• Right human(s) (scope)
• Humans not automation
• Crucial details:
• Where the alert is coming from
• What expectation was violated
• Why this is an issue (for the customer)
• Steps to resolve the problem (or at least a specific pointer)
@RENEV
59. BLAMELESS LEARNING REVIEW
INVOLVE EVERYONE
DO IT ASAP
FOCUS ON WHAT HAPPENED
WHAT DID THEY KNOW
WHEN DID THEY KNOW
HOW DID IT MAKE SENSE
CREATE A TIMELINE
WHAT WAS THE THOUGHT PROCESS?
@RENEV
65. The “Paul” attack
Response training (game days)
On-call rotations
Escalation paths
Communication channels
Chaos engineering/testing in
BE READY
@RENEV
66.
67. WRAP UP
CONSIDER YOUR SLA CAREFULLY
SRE IS NOT A NEW OPS DEPARTMENT
USE SLO’S AND ERROR BUDGETS
MEASURE EVERYTHING
CLOSE THE INCIDENT LOOP WITH LEARNING
@RENEV
## Real problem
Operations does not know the code base and have the strongest incentive to block a release. That is a problem.
No focus on business value..
Photo by chuttersnap on Unsplash
* Ultimately, the definition that is widely used, especially in our Microsoft eco-system is the definition of Donovan Brown.
* DevOps is the union of Process, People and Products to enable continuous delivery of value to our end-users
* Important to mention that most sessions, also of me cover the products and a bit of the process part but this session is really about the people part. Because, ultimately that is what it is all about.
* This session talks mainly about the people part
Photo by Derek Thomson on Unsplash
Make people responsible
4-types of money
Now that we've covered some of the most common aspects of reliability with which we should begin measuring, let's talk about how we represent those measurements for human consumption as well as allowing us to move one step closer to establishing thresholds to alert on user impacting problems.
Service Level Indicators (SLIs) are a ratio. Typically then multiplied by 100 to be represented as a proportion (of 100%). 99.99% reliable is an example of how SLIs are represented.
You have to be VERY clear about WHERE the data is coming from.
You have to make sure everyone is explicit about the data.
"As we measured it at the load balancer" ... that is specific.
Is the measurement from the server or the client?
We’re going to start building an SLO
In our example, we are going to use HTTP requests. But you decide what you are measuring (that is important to the customer)
We have to be specific and clear about where the data is coming from (Example: As reported by the load balancer).
This allows everyone to have a shared language across stakeholders in the company around reliability that everyone can agree on and move towards.
Important! We are trying to have clarity when around these things so when we need to take action, we know what to do.
It’s real important you have a time period, so everyone is clear on the expectations.
https://microsoft.github.io/AzureBot/
Photo by Alex Litvin on Unsplash
Photo by Tim Mossholder on UnsplashPhoto by Tim Mossholder on Unsplash
You can see that what we monitor can vary depending on what we care about.
The main thing to remember is that everything we exam MUST BE FROM THE PERSPECTIVE OF THE CUSTOMER.
Was it available TO THE CUSTOMER.
Measuring the availability of the component isn't that useful or helpful in understanding reliability.
- Availability: Can my system answer a question it is asked?
- Availability: Can my system answer a question it is asked?
- Availability: Can my system answer a question it is asked?
This indicator is generally used in one of two ways. For batch processing it could be the proportion of jobs that processed above some target amount of data. For streaming processing the proportion of records that were successfully processed within a window.
- Correctness: Generally used when looking at pipelines as a measurement of some kind of processing on data. We measure for the proportion of records going into the pipeline that result in a valid and correct result coming out
- Fidelity sometimes referred to as quality: Graceful failures. Lowering the fidelity but it's still reasonable. How often did I serve the entire page as I expected versus how often did I have to serve a simple plain site?
If bandwidth limited, it may be acceptable to modify the image we provide for a request, so we intentionally have a policy to send images at a lower fidelity. So the measurement here would be the proportion of requests that were served in an undegraded state over the total number of records served.
- Freshness: is the data being served current, is the cache refreshed/purged as needed. How often am I serving stale data
- Durability: If you wrote your data to Azure storage or page blob do you have long-term data protection, i.e. the stored data does not suffer from bit rot, degradation or other corruption after an outage.
Google Video
Our SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time. At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features. Feature development typically focuses on improving reliability, performance, or utilization, which often reduces toil as a second-order effect.
We share this 50% goal because toil tends to expand if left unchecked and can quickly fill 100% of everyone’s time. The work of reducing toil and scaling up services is the "Engineering" in Site Reliability Engineering. Engineering work is what enables the SRE organization to scale up sublinearly with service size and to manage services more efficiently than either a pure Dev team or a pure Ops team.
Furthermore, when we hire new SREs, we promise them that SRE is not a typical Ops organization, quoting the 50% rule just mentioned. We need to keep that promise by not allowing the SRE organization or any subteam within it to devolve into an Ops team.
Google SRE book definitions
Scaling linearly as service grows: If you have something you have to do for 10 users, do you have to do it 10x as much for 100 users or 100x for 1000 users?
Photo by Gary Chan on Unsplash
https://microsoft.github.io/AzureBot/
Photo by Alex Litvin on Unsplash
Self-service.. ChatOps.. Slack.. Can you reboot the server..
Photo by Tim Mossholder on Unsplash
Many complex components.observable system with rich data. Adjustments match expectations
No matter how our systems are architected, at the end of the day their sole purpose is to provide value to an end user. Expectations exist for the "customer". As such, we must know at any given time if the applications, infrasctructure, etc. are doing what we expect them to do? And do those expectations align with the customer's?
Engineers, teams, organizations, and the business as a whole has goals in which they set out to achieve. With each change made to our system it is important to understand if those changes are bringing a positive, negative, or neutral impact to any explicitly stated goals. Perhaps a KPI for one aspect of a system is to be able to handle a 10x increase in traffic should such a spike occur. What monitoring do we have in place that can successfully confirm if recent improvements (i.e. changes) actually solve for that goal.
In fact, it is the constant change of our systems that require us to know whether or not those changes are doing what we expected them to. It could be a new product or merely a hot-fix to existing services, but we must be able to measure the change that is being introduced to the system and determine if it is meeting our expectations?
https://microsoft.github.io/AzureBot/
Photo by Alex Litvin on Unsplash
Show PU site
Show Live Metrics
Show Log Analytics
Show Application Map
Show End to End Traces
Show Azure Monitor
Photo by Tim Mossholder on Unsplash
Alert fatigue
Titles – easier scanning
Alerts have a specific purpose and are not logs. They should require a human to do something
Netflix pulse of netflix
https://microsoft.github.io/AzureBot/
Photo by Alex Litvin on Unsplash
Self-service.. ChatOps.. Slack.. Can you reboot the server..
Photo by Tim Mossholder on Unsplash
Photo by Tim Mossholder on Unsplash
Photo by Tim Mossholder on Unsplash
“What was the thought process when the engineer took that action?”
After we conduct a post-incident review, we should announce the availability of the meeting notes and any associated artifacts (e.g. timelines, chat logs, status page information).
This information should be placed in a centralized location where the entire organization can access it and learn from the incident.
That way we can translate local learnings and improvements into global learnings and improvements.
After we conduct a post-incident review, we should announce the availability of the meeting notes and any associated artifacts (e.g. timelines, chat logs, status page information).
This information should be placed in a centralized location where the entire organization can access it and learn from the incident.
That way we can translate local learnings and improvements into global learnings and improvements.
https://microsoft.github.io/AzureBot/
Photo by Alex Litvin on Unsplash
Self-service.. ChatOps.. Slack.. Can you reboot the server..
Photo by Tim Mossholder on Unsplash
Marcel
## Rene - V ##
* So this is me. Your personal gate for today
* I work as a DevOps consultant
* Normally with Dev Teams and such. But one of my latest assignments was at a bank
* Security was something I never had a particular interest in. Just like many of us
* Boring stuff, not nice. And very very restrictive
* Until that moment when I needed to create "Secure Pipelines"
What is your gender mentimeter
As Eliyah Goldratt mentions in his book the Goal, there is ultimately one goal of every (commercial) organisation. And that is .. to make money. And nowadays, making money is to be better than your competitor, or even worse, your future competitor.
* Cloud enables really small organisations to compete with really large ones.
* With only a credit card, a good idea and some skillfull people you can make a difference
Photo by Pepi Stojanovski on Unsplash