When down is not good enough. SRE On Azure - PolarConf

WHEN "WE ARE DOWN" IS NOT GOOD ENOUGH
SITE RELIABILITY ENGINEERING (SRE) IN AZURE
RENÉ VAN OSNABRUGGE
@RENEVO

LET’S COUNT
TOGETHER @RENEVO

AVAILABILITY IN HUMAN WORDS
per year
per
quarter
per
month per week per day per hour
90% 36.5 days 9 days 3 days 16.8 hours 2.4 hours 6 minutes
95% 18.25 days 4.5 days 1.5 days 8.4 hours 1.2 hours 3 minutes
99% 3.65 days 21.6 hours 7.2 hours 1.68 hours 14.4 minutes 36 seconds
99.5% 1.83 days 10.8 hours 3.6 hours 50.4 minutes 7.20 minutes 18 seconds
99.9% 8.76 hours 2.16 hours 43.2 minutes 10.1 minutes 1.44 minutes 3.6 seconds
99.95% 4.38 hours 1.08 hours 21.6 minutes 5.04 minutes 43.2 seconds 1.8 seconds
99.99% 52.6 minutes 12.96 minutes 4.32 minutes 60.5 seconds 8.64 seconds 0.36 seconds
99.999% 5.26 minutes 1.30 minutes 25.9 seconds 6.05 seconds 0.87 seconds 0.04 seconds
@RENEVO

WHAT DO OTHERS HAVE?
Azure DevOps
99.9%
Azure Policy
----
Gmail
99.9%
ex. planned
Azure AAD
99.9%
Outlook.com
----
Azure Functions
99.95%
Google Maps API
99.9%
ING Banking
----
Slack
99.99%
@RENEVO

WHAT WE HEAR FROM OPS
“We need to have a launch review!”
“Please let the CAB approve first”
“This is our change management checklist”
“We should validate first in Test and Pre-Prod”
@RENEVO

WHAT WE HEAR FROM DEV
“We do not launch big changes, this is just a flag flip”
“This is a hotfix!”
“This is just a UI change... No big thing”
“Only a 20% experiment.”
@RENEVO

CONFLICTING KPI’S
DOES NOT WORK
@RENEVO

2 PARALLEL MOVEMENTS
EMERGED
@RENEVO

“DEVOPS IS THE UNION OF PEOPLE,
PROCESS AND PRODUCTS TO ENABLE
CONTINUOUS DELIVERY OF VALUE TO
OUR END-USERS”
DONOVAN BROWN
@RENEVO

SITE RELIABILITY ENGINEERING IS AN
ENGINEERING DISCIPLINE DEVOTED TO
HELPING AN ORGANIZATION SUSTAINABLY
ACHIEVE THE APPROPRIATE LEVEL OF
RELIABILITY IN THEIR SYSTEMS, SERVICES,
AND PRODUCTS.
@RENEVO

public class SRE: DevOps {
…
}
@RENEVO

DEVOPS SRE
REDUCE ORGANIZATIONAL SILOS
ACCEPT FAILURE AS NORMAL
IMPLEMENT GRADUAL CHANGE
LEVERAGE TOOLING & AUTOMATION
MEASURE EVERYTHING
SHARE OWNERSHIP
SLO’S & BLAMELESS LEARNING
REVIEWS
REDUCE COST OF FAILURE
AUTOMATION AS 1st CLASS CITIZEN
MEASURE TOIL AND RELIABILITY
@RENEVO

SRE focusses on the reliability of your site
Product team focusses on the business value
They share ownership
They are all engineers
@RENEVO

THE PRICE OF GREATNESS IS RESPONSIBILITY
WINSTON CHURCHILL
@RENEVO

S L I
SERVICE LEVEL INDICATOR
@RENEVO

SLIS ARE A RATIO/PROPORTION
# of successful HTTP calls/# of HTTP calls
# of operations that completed in < 10ms/# of operations
# of “full quality responses”/# of responses
# of records processed/# of records
Ratio * 100 = % proportion
@RENEVO

SLIS ARE A RATIO/PROPORTION & HOW
# of successful HTTP calls/# of HTTP calls
# of operations that completed in < 10 ms/# of operations
…as measured at the client
…as measured at the load balancer
@RENEVO

S L O
SERVICE LEVEL OBJECTIVE
@RENEVO

BASIC SLO RECIPE
THE THING
HTTP requests
Storage checks
Operations
SLI PROPORTIONS
”Successful 50% of the time”
“Can read the data 99.9% of the time”
“Return in 10ms 90% of the time”
TIME STATEMENT
“In the last ten minute period”
”During last quarter”
“In the previous rolling 30 day period”
90% of HTTP requests
as reported by the
load balancer
succeeded in the last
30 day window.
SERVICE LEVEL OBJECTIVE (SLO)
@RENEVO

AGREEMENT BETWEEN THE PRODUCT TEAM AND SRE
@RENEVO

SLA
UPTIME
COMMERCIAL AGREEMENT
WITH CUSTOMER
SLOSLA
@RENEVO

ERROR BUDGETS
UPTIME
ERROR BUDGET
SLO
@RENEVO

ERROR BUDGETS
UPTIME
NO
ERROR BUDGET
NO
RELEASE
SLO
@RENEVO

DEMO – SLI AND SLO IN LOG ANALYTICS
@RENEVO

WHAT DOES DOWN ACTUALLY MEAN?
@RENEVO

RELIABILITY
AVAILABILITY LATENCY
THROUGHPUT
CORRECTNESS
FRESHNESS
COVERAGE
FIDELITY
DURABILITY
IT MUST BE ABOUT THE CUSTOMER, NOT THE
SOFTWARE COMPONENT @RENEVO

WHAT DOES AN SRE DO?
TIME ALLOCATION OF SRE
50%
Operational Work
50%
Project Work
• Incidents
• Tickets
• Operational work
• Project Work in Product Teams
• Add Service Features
• Reduce future toil
@RENEVO

TOIL
• Manual
• Repetitive
• Automatable
• Tactical
• Devoid of enduring value
• cales linearly as a service grows
@RENEVO

DEMO – USING AZURE BOTS TO REDUCE TOIL
@RENEVO

WHY MONITOR?
Are my apps and infrastructure doing what I expect?
Are my apps and infrastructure doing what others expect?
@RENEVO

3 KINDS OF MONITORING OUTPUT
ALERTS
TICKETS
LOGGING
@RENEVO

WHO RESPONDS?
SHARED RESPONSIBILITY OF SRE
@RENEVO

ALERT FATIGUE
A serious problem
@RENEVO

ACTIONABLE ALERTS
• Alerts are not: logs, notifications, heartbeats, normal
• Needs a human to investigate (and ideally resolve)
• Right human(s) (scope)
• Humans not automation
• Crucial details:
• Where the alert is coming from
• What expectation was violated
• Why this is an issue (for the customer)
• Steps to resolve the problem (or at least a specific pointer)
@RENEVO

DEMO – ACTIONABLE ALERTS
@RENEVO

MTTR
On average, how long does it take to restore service when a
service incident occurs? @RENEVO

CONTINUOUS DELIVERY AS
A MEANS TO AN END
@RENEVO

BLAMELESS LEARNING REVIEW
INVOLVE EVERYONE
DO IT ASAP
FOCUS ON WHAT HAPPENED
WHAT DID THEY KNOW
WHEN DID THEY KNOW
HOW DID IT MAKE SENSE
CREATE A TIMELINE
WHAT WAS THE THOUGHT PROCESS?
@RENEVO

OUTCOMES
COUNTERMEASURES
LONG TERM
SHORT TERM
INTERNAL SUMMARY
EXTERNAL SUMMARY
@RENEVO

DEMO – TROUBLESHOOTING GUIDES
@RENEVO

MISTAKES ARE MEANT FOR LEARNING
NOT FOR REPEATING
@RENEVO

The “Paul” attack
Response training (game days)
On-call rotations
Escalation paths
Communication channels
Chaos engineering/testing in production
BE READY
@RENEVO

WRAP UP
CONSIDER YOUR SLA CAREFULLY
SRE IS NOT A NEW OPS DEPARTMENT
USE SLO’S AND ERROR BUDGETS
MEASURE EVERYTHING
CLOSE THE INCIDENT LOOP WITH LEARNING
@RENEVO

René van Osnabrugge
Xpirit Netherlands
@renevo
rvanosnabrugge@xpirit.com
https://roadtoalm.com
Attributions
Pictures: https://unsplash.com / https://www.flickr.com/photos/wocintechchat
Gifs: https://giphy.com
Music: https://open.spotify.com/user/rvanosnabrugge/playlist/
0BWgsNPM5iwgk8ZGlMHeoY?si=l9-tV8FTR8S1J7AbKBz-KA
Video: https://www.youtube.com/watch?v=SGAnLY46zAk
Thanks: Martijn, Xpirit

When down is not good enough. SRE On Azure - PolarConf

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie When down is not good enough. SRE On Azure - PolarConf

Ähnlich wie When down is not good enough. SRE On Azure - PolarConf (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

When down is not good enough. SRE On Azure - PolarConf