VSLive Orlando 2019 - When "We are down" is not good enough. SRE on Azure

VST15
WHEN "WE ARE DOWN" IS NOT GOOD
ENOUGH
SRE ON AZURERENÉ VAN OSNABRUGGE
@RENEVO

AVAILABILITY IN HUMAN WORDS
per year
per
quarter
per
month
per
week per day
per
hour
90% 36.5 days 9 days 3 days 16.8 hours 2.4 hours 6 minutes
95% 18.25 days 4.5 days 1.5 days 8.4 hours 1.2 hours 3 minutes
99% 3.65 days 21.6 hours 7.2 hours 1.68 hours 14.4 minutes 36 seconds
99.5% 1.83 days 10.8 hours 3.6 hours 50.4 minutes 7.20 minutes 18 seconds
99.9% 8.76 hours 2.16 hours 43.2 minutes 10.1 minutes 1.44 minutes 3.6 seconds
99.95% 4.38 hours 1.08 hours 21.6 minutes 5.04 minutes 43.2 seconds 1.8 seconds
99.99% 52.6 minutes 12.96
minutes
4.32 minutes 60.5 seconds 8.64 seconds 0.36 seconds
99.999% 5.26 minutes 1.30 minutes 25.9 seconds 6.05 seconds 0.87 seconds 0.04 seconds
@RENEV

WHAT DO OTHERS HAVE?
Azure DevOps
99.9%
Azure Policy
----
Gmail
99.9%
ex. planned
Azure AAD
99.9%
Outlook.com
----
Azure Functions
99.95%
Google Maps API
99.9%
ING Banking
----
Slack
99.99%
@RENEV

WHAT WE HEAR FROM
OPS
“We need to have a launch review!”
“Please let the CAB approve first”
“This is our change management checklist”
“We should validate first in Test and Pre-Prod”
@RENEV

WHAT WE HEAR FROM
DEV
“We do not launch big changes, this is just a
flag flip”
“This is a hotfix!”
“This is just a UI change... No big thing”
“Only a 20% experiment.”
@RENEV

CONFLICTING KPI’S
DO NOT WORK
@RENEV

2 PARALLEL MOVEMENTS
EMERGED
@RENEV

“DEVOPS IS THE UNION OF
PEOPLE, PROCESS AND
PRODUCTS TO ENABLE
CONTINUOUS DELIVERY OF VALUE
TO OUR END-USERS”DONOVAN BROWN
@RENEV

SITE RELIABILITY ENGINEERING IS AN
ENGINEERING DISCIPLINE DEVOTED
TO HELPING AN ORGANIZATION
SUSTAINABLY ACHIEVE THE
APPROPRIATE LEVEL OF RELIABILITY
IN THEIR SYSTEMS, SERVICES, AND
PRODUCTS.
@RENEV

public class SRE: DevOps {
…
}
@RENEV

DEVOPS SRE
REDUCE
ORGANIZATIONAL SILOS
ACCEPT FAILURE AS
NORMAL
IMPLEMENT GRADUAL
CHANGE
LEVERAGE TOOLING &
AUTOMATION
SHARE OWNERSHIP
SLO’S & BLAMELESS
LEARNING REVIEWS
REDUCE COST OF
FAILURE
AUTOMATION AS 1st
CLASS CITIZEN
@RENEV

SRE focusses on the reliability of your site
Product team focusses on the business value
They share ownership
They are all engineers
@RENEV

THE PRICE OF GREATNESS IS
RESPONSIBILITY
WINSTON CHURCHILL
@RENEV

S L I
SERVICE LEVEL INDICATOR
@RENEV

SLIS ARE A RATIO/PROPORTION
# of successful HTTP calls/# of HTTP calls
# of operations that completed in < 10ms/# of operations
# of “full quality responses”/# of responses
# of records processed/# of records
Ratio * 100 = % proportion
@RENEV

SLIS ARE A RATIO/PROPORTION & HOW
# of successful HTTP calls/# of HTTP calls
# of operations that completed in < 10 ms/# of operations
…as measured at the client
…as measured at the load balancer
@RENEV

S L O
SERVICE LEVEL OBJECTIVE
@RENEV

BASIC SLO RECIPE
THE THING
HTTP requests
Storage checks
Operations
SLI PROPORTIONS
”Successful 50% of the time”
“Can read the data 99.9% of the time”
“Return in 10ms 90% of the time”
TIME STATEMENT
“In the last ten minute period”
”During last quarter”
“In the previous rolling 30 day period”
90% of HTTP requests
as reported by the
load balancer
succeeded in the last
30 day window.
SERVICE LEVEL OBJECTIVE (SLO)
@RENEV

AGREEMENT BETWEEN THE PRODUCT TEAM
AND SRE
@RENEV

SLA
UPTIME
COMMERCIAL AGREEMENT
WITH CUSTOMER
SLOSLA
@RENEV

ERROR BUDGETS
UPTIME
ERROR BUDGET
SLO
@RENEV

ERROR BUDGETS
UPTIME
NO
ERROR BUDGET
NO
RELEASE
SLO
@RENEV

DEMO – SLI AND SLO IN LOG ANALYTICS
@RENEV

WHAT DOES DOWN ACTUALLY
MEAN? @RENEV

RELIABILITY
AVAILABILITY LATENCY
THROUGHPU
T
CORRECTNE
SS
FRESHNESS
COVERAGE
FIDELITY
DURABILITY
IT MUST BE ABOUT THE CUSTOMER, NOT THE
SOFTWARE COMPONENT @RENEV

AVAILABILITY
Can my
system
answer a
question it is
asked?

LATENCY
Is the system
(or service)
returning an
answer in the
amount of
time it needs
to?SLOW IS THE NEW

THROUGHPUT
Can you get
the volume of
data from
one point to
another?

COVERAGE
Has all data
been
processed?

CORRECTNESS
The proportion
of records
going into the
pipeline that
result in a valid
and correct
result coming
out

FIDELITY
How many
requests are
served in an
undegraded
way?

FRESHNESS
Freshness: is the
data being
served current, is
the cache
refreshed/purged
as needed. How
often am I
serving stale
data

DURABILITY
Does the stored
data not suffer
from bit rot,
degradation or
other corruption
after an outage.

WHAT DOES AN SRE DO
TIME ALLOCATION OF SRE
50%
Operational Work
50%
Project Work
• Incidents
• Tickets
• Operational work
• Project Work in Product
Teams
• Add Service Features
• Reduce future toil
@RENEV

TOIL
• Manual
• Repetitive
• Automatable
• Tactical
• Devoid of enduring value
• cales linearly as a service grows
@RENEV

DEMO – USING AZURE BOTS TO REDUCE
TOIL
@RENEV

WHY MONITOR?
Are my apps and infrastructure doing what I expect?
Are my apps and infrastructure doing what others expect?
@RENEV

3 KINDS OF MONITORING OUTPUT
ALERTS
TICKETS
LOGGING
@RENEV

WHO RESPONDS?
SHARED RESPONSIBILITY OF SRE
@RENEV

ACTIONABLE ALERTS
• Alerts are not: logs, notifications, heartbeats, normal
• Needs a human to investigate (and ideally resolve)
• Right human(s) (scope)
• Humans not automation
• Crucial details:
• Where the alert is coming from
• What expectation was violated
• Why this is an issue (for the customer)
• Steps to resolve the problem (or at least a specific pointer)
@RENEV

DEMO – ACTIONABLE ALERTS
@RENEV

MTTR
On average, how long does it take to restore service
when a service incident occurs? @RENEV

CONTINUOUS DELIVERY IS
A MEANS TO AN END
@RENEV

BLAMELESS LEARNING REVIEW
INVOLVE EVERYONE
DO IT ASAP
FOCUS ON WHAT HAPPENED
WHAT DID THEY KNOW
WHEN DID THEY KNOW
HOW DID IT MAKE SENSE
CREATE A TIMELINE
WHAT WAS THE THOUGHT PROCESS?
@RENEV

OUTCOMES
COUNTERMEASURES
LONG TERM
SHORT TERM
INTERNAL SUMMARY
EXTERNAL SUMMARY
@RENEV

EXTERNAL SUMMARY
Summary
Services impacted
Duration
Severity
Customer impact
Resolution
Countermeasures
and improvements
@RENEV

DEMO – TROUBLESHOOTING GUIDES
@RENEV

MISTAKES ARE MEANT FOR LEARNING
NOT FOR REPEATING
@RENEV

The “Paul” attack
Response training (game days)
On-call rotations
Escalation paths
Communication channels
Chaos engineering/testing in
BE READY
@RENEV

WRAP UP
CONSIDER YOUR SLA CAREFULLY
SRE IS NOT A NEW OPS DEPARTMENT
USE SLO’S AND ERROR BUDGETS
MEASURE EVERYTHING
CLOSE THE INCIDENT LOOP WITH LEARNING
@RENEV

René van Osnabrugge
Xpirit Netherlands
@renevo
rvanosnabrugge@xpirit.com
https://roadtoalm.com
Attributions
Pictures: https://unsplash.com / https://www.flickr.com/photos/wocintechchat
Gifs: https://giphy.com
Music: https://open.spotify.com/user/rvanosnabrugge/playlist/
0BWgsNPM5iwgk8ZGlMHeoY?si=l9-tV8FTR8S1J7AbKBz-KA
Video: https://www.youtube.com/watch?v=SGAnLY46zAk
Thanks: Martijn, Xpirit

VSLive Orlando 2019 - When "We are down" is not good enough. SRE on Azure

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie VSLive Orlando 2019 - When "We are down" is not good enough. SRE on Azure

Ähnlich wie VSLive Orlando 2019 - When "We are down" is not good enough. SRE on Azure (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

VSLive Orlando 2019 - When "We are down" is not good enough. SRE on Azure

Hinweis der Redaktion