The more we are connected and the more others are connected to us, the more important reliability of your sites becomes. Site Reliability Engineering is an engineering discipline devoted to helping an organization sustainably achieve the appropriate level of reliability in their systems, services, and products. But what does this mean, and how do get started with this? In this session I will talk about the concepts of Site Reliability Engineering and use Microsoft Azure to implement some of the concepts and practices
9. WHAT WE HEAR FROM OPS
“We need to have a launch review!”
“Please let the CAB approve first”
“This is our change management checklist”
“We should validate first in Test and Pre-Prod”
@RENEVO
10. WHAT WE HEAR FROM DEV
“We do not launch big changes, this is just a flag flip”
“This is a hotfix!”
“This is just a UI change... No big thing”
“Only a 20% experiment.”
@RENEVO
15. SITE RELIABILITY ENGINEERING IS AN
ENGINEERING DISCIPLINE DEVOTED TO
HELPING AN ORGANIZATION SUSTAINABLY
ACHIEVE THE APPROPRIATE LEVEL OF
RELIABILITY IN THEIR SYSTEMS, SERVICES,
AND PRODUCTS.
@RENEVO
22. SLIS ARE A RATIO/PROPORTION
# of successful HTTP calls/# of HTTP calls
# of operations that completed in < 10ms/# of operations
# of “full quality responses”/# of responses
# of records processed/# of records
Ratio * 100 = % proportion
@RENEVO
23. SLIS ARE A RATIO/PROPORTION & HOW
# of successful HTTP calls/# of HTTP calls
# of operations that completed in < 10 ms/# of operations
…as measured at the client
…as measured at the load balancer
@RENEVO
25. BASIC SLO RECIPE
THE THING
HTTP requests
Storage checks
Operations
SLI PROPORTIONS
”Successful 50% of the time”
“Can read the data 99.9% of the time”
“Return in 10ms 90% of the time”
TIME STATEMENT
“In the last ten minute period”
”During last quarter”
“In the previous rolling 30 day period”
90% of HTTP requests
as reported by the
load balancer
succeeded in the last
30 day window.
SERVICE LEVEL OBJECTIVE (SLO)
@RENEVO
34. WHAT DOES AN SRE DO?
TIME ALLOCATION OF SRE
50%
Operational Work
50%
Project Work
• Incidents
• Tickets
• Operational work
• Project Work in Product Teams
• Add Service Features
• Reduce future toil
@RENEVO
35. TOIL
• Manual
• Repetitive
• Automatable
• Tactical
• Devoid of enduring value
• cales linearly as a service grows
@RENEVO
45. ACTIONABLE ALERTS
• Alerts are not: logs, notifications, heartbeats, normal
• Needs a human to investigate (and ideally resolve)
• Right human(s) (scope)
• Humans not automation
• Crucial details:
• Where the alert is coming from
• What expectation was violated
• Why this is an issue (for the customer)
• Steps to resolve the problem (or at least a specific pointer)
@RENEVO
53. BLAMELESS LEARNING REVIEW
INVOLVE EVERYONE
DO IT ASAP
FOCUS ON WHAT HAPPENED
WHAT DID THEY KNOW
WHEN DID THEY KNOW
HOW DID IT MAKE SENSE
CREATE A TIMELINE
WHAT WAS THE THOUGHT PROCESS?
@RENEVO
58. The “Paul” attack
Response training (game days)
On-call rotations
Escalation paths
Communication channels
Chaos engineering/testing in production
BE READY
@RENEVO
59. WRAP UP
CONSIDER YOUR SLA CAREFULLY
SRE IS NOT A NEW OPS DEPARTMENT
USE SLO’S AND ERROR BUDGETS
MEASURE EVERYTHING
CLOSE THE INCIDENT LOOP WITH LEARNING
@RENEVO