SlideShare ist ein Scribd-Unternehmen logo
1 von 68
VST15
WHEN "WE ARE DOWN" IS NOT GOOD
ENOUGH
SRE ON AZURERENÉ VAN OSNABRUGGE
@RENEVO
LET’S COUNT
TOGETHER @RENEV
99.9%
@RENEV
AVAILABILITY IN HUMAN WORDS
per year
per
quarter
per
month
per
week per day
per
hour
90% 36.5 days 9 days 3 days 16.8 hours 2.4 hours 6 minutes
95% 18.25 days 4.5 days 1.5 days 8.4 hours 1.2 hours 3 minutes
99% 3.65 days 21.6 hours 7.2 hours 1.68 hours 14.4 minutes 36 seconds
99.5% 1.83 days 10.8 hours 3.6 hours 50.4 minutes 7.20 minutes 18 seconds
99.9% 8.76 hours 2.16 hours 43.2 minutes 10.1 minutes 1.44 minutes 3.6 seconds
99.95% 4.38 hours 1.08 hours 21.6 minutes 5.04 minutes 43.2 seconds 1.8 seconds
99.99% 52.6 minutes 12.96
minutes
4.32 minutes 60.5 seconds 8.64 seconds 0.36 seconds
99.999% 5.26 minutes 1.30 minutes 25.9 seconds 6.05 seconds 0.87 seconds 0.04 seconds
@RENEV
WHAT DO OTHERS HAVE?
Azure DevOps
99.9%
Azure Policy
----
Gmail
99.9%
ex. planned
Azure AAD
99.9%
Outlook.com
----
Azure Functions
99.95%
Google Maps API
99.9%
ING Banking
----
Slack
99.99%
@RENEV
OPS VERSUS DEV @RENEV
WHAT WE HEAR FROM
OPS
“We need to have a launch review!”
“Please let the CAB approve first”
“This is our change management checklist”
“We should validate first in Test and Pre-Prod”
@RENEV
WHAT WE HEAR FROM
DEV
“We do not launch big changes, this is just a
flag flip”
“This is a hotfix!”
“This is just a UI change... No big thing”
“Only a 20% experiment.”
@RENEV
CONFLICTING KPI’S
DO NOT WORK
@RENEV
2 PARALLEL MOVEMENTS
EMERGED
@RENEV
“DEVOPS IS THE UNION OF
PEOPLE, PROCESS AND
PRODUCTS TO ENABLE
CONTINUOUS DELIVERY OF VALUE
TO OUR END-USERS”DONOVAN BROWN
@RENEV
@RENEV
SITE RELIABILITY ENGINEERING IS AN
ENGINEERING DISCIPLINE DEVOTED
TO HELPING AN ORGANIZATION
SUSTAINABLY ACHIEVE THE
APPROPRIATE LEVEL OF RELIABILITY
IN THEIR SYSTEMS, SERVICES, AND
PRODUCTS.
@RENEV
public class SRE: DevOps {
…
}
@RENEV
DEVOPS SRE
REDUCE
ORGANIZATIONAL SILOS
ACCEPT FAILURE AS
NORMAL
IMPLEMENT GRADUAL
CHANGE
LEVERAGE TOOLING &
AUTOMATION
SHARE OWNERSHIP
SLO’S & BLAMELESS
LEARNING REVIEWS
REDUCE COST OF
FAILURE
AUTOMATION AS 1st
CLASS CITIZEN
@RENEV
SRE focusses on the reliability of your site
Product team focusses on the business value
They share ownership
They are all engineers
@RENEV
THE PRICE OF GREATNESS IS
RESPONSIBILITY
WINSTON CHURCHILL
@RENEV
S L *
SLI SLO SLA
@RENEV
S L I
SERVICE LEVEL INDICATOR
@RENEV
SLIS ARE A RATIO/PROPORTION
# of successful HTTP calls/# of HTTP calls
# of operations that completed in < 10ms/# of operations
# of “full quality responses”/# of responses
# of records processed/# of records
Ratio * 100 = % proportion
@RENEV
SLIS ARE A RATIO/PROPORTION & HOW
# of successful HTTP calls/# of HTTP calls
# of operations that completed in < 10 ms/# of operations
…as measured at the client
…as measured at the load balancer
@RENEV
S L O
SERVICE LEVEL OBJECTIVE
@RENEV
BASIC SLO RECIPE
THE THING
HTTP requests
Storage checks
Operations
SLI PROPORTIONS
”Successful 50% of the time”
“Can read the data 99.9% of the time”
“Return in 10ms 90% of the time”
TIME STATEMENT
“In the last ten minute period”
”During last quarter”
“In the previous rolling 30 day period”
90% of HTTP requests
as reported by the
load balancer
succeeded in the last
30 day window.
SERVICE LEVEL OBJECTIVE (SLO)
@RENEV
AGREEMENT BETWEEN THE PRODUCT TEAM
AND SRE
@RENEV
SLA
UPTIME
COMMERCIAL AGREEMENT
WITH CUSTOMER
SLOSLA
@RENEV
ERROR BUDGETS
UPTIME
ERROR BUDGET
SLO
@RENEV
ERROR BUDGETS
UPTIME
ERROR BUDGET
SLO
@RENEV
ERROR BUDGETS
UPTIME
NO
ERROR BUDGET
NO
RELEASE
SLO
@RENEV
DEMO – SLI AND SLO IN LOG ANALYTICS
@RENEV
WHAT DOES DOWN ACTUALLY
MEAN? @RENEV
RELIABILITY
AVAILABILITY LATENCY
THROUGHPU
T
CORRECTNE
SS
FRESHNESS
COVERAGE
FIDELITY
DURABILITY
IT MUST BE ABOUT THE CUSTOMER, NOT THE
SOFTWARE COMPONENT @RENEV
AVAILABILITY
Can my
system
answer a
question it is
asked?
LATENCY
Is the system
(or service)
returning an
answer in the
amount of
time it needs
to?SLOW IS THE NEW
THROUGHPUT
Can you get
the volume of
data from
one point to
another?
COVERAGE
Has all data
been
processed?
CORRECTNESS
The proportion
of records
going into the
pipeline that
result in a valid
and correct
result coming
out
FIDELITY
How many
requests are
served in an
undegraded
way?
FRESHNESS
Freshness: is the
data being
served current, is
the cache
refreshed/purged
as needed. How
often am I
serving stale
data
DURABILITY
Does the stored
data not suffer
from bit rot,
degradation or
other corruption
after an outage.
WHAT DOES AN SRE DO
TIME ALLOCATION OF SRE
50%
Operational Work
50%
Project Work
• Incidents
• Tickets
• Operational work
• Project Work in Product
Teams
• Add Service Features
• Reduce future toil
@RENEV
TOIL
• Manual
• Repetitive
• Automatable
• Tactical
• Devoid of enduring value
• cales linearly as a service grows
@RENEV
@RENEV
DEMO – USING AZURE BOTS TO REDUCE
TOIL
@RENEV
OPERATIONAL WORK
@RENEV
DETECT @RENEV
WHY MONITOR?
Are my apps and infrastructure doing what I expect?
Are my apps and infrastructure doing what others expect?
@RENEV
3 KINDS OF MONITORING OUTPUT
ALERTS
TICKETS
LOGGING
@RENEV
RESPOND @RENEV
WHO RESPONDS?
SHARED RESPONSIBILITY OF SRE
@RENEV
ALERT
FATIGUE @RENEV
ACTIONABLE ALERTS
• Alerts are not: logs, notifications, heartbeats, normal
• Needs a human to investigate (and ideally resolve)
• Right human(s) (scope)
• Humans not automation
• Crucial details:
• Where the alert is coming from
• What expectation was violated
• Why this is an issue (for the customer)
• Steps to resolve the problem (or at least a specific pointer)
@RENEV
IT IS ABOUT THE USER!
@RENEV
DEMO – ACTIONABLE ALERTS
@RENEV
REMEDIATE @RENEV
MTTR
On average, how long does it take to restore service
when a service incident occurs? @RENEV
CONTINUOUS DELIVERY IS
A MEANS TO AN END
@RENEV
ANALYZE @RENEV
@RENEV
BLAMELESS LEARNING REVIEW
INVOLVE EVERYONE
DO IT ASAP
FOCUS ON WHAT HAPPENED
WHAT DID THEY KNOW
WHEN DID THEY KNOW
HOW DID IT MAKE SENSE
CREATE A TIMELINE
WHAT WAS THE THOUGHT PROCESS?
@RENEV
OUTCOMES
COUNTERMEASURES
LONG TERM
SHORT TERM
INTERNAL SUMMARY
EXTERNAL SUMMARY
@RENEV
EXTERNAL SUMMARY
Summary
Services impacted
Duration
Severity
Customer impact
Resolution
Countermeasures
and improvements
@RENEV
DEMO – TROUBLESHOOTING GUIDES
@RENEV
READINESS @RENEV
MISTAKES ARE MEANT FOR LEARNING
NOT FOR REPEATING
@RENEV
The “Paul” attack
Response training (game days)
On-call rotations
Escalation paths
Communication channels
Chaos engineering/testing in
BE READY
@RENEV
WRAP UP
CONSIDER YOUR SLA CAREFULLY
SRE IS NOT A NEW OPS DEPARTMENT
USE SLO’S AND ERROR BUDGETS
MEASURE EVERYTHING
CLOSE THE INCIDENT LOOP WITH LEARNING
@RENEV
René van Osnabrugge
Xpirit Netherlands
@renevo
rvanosnabrugge@xpirit.com
https://roadtoalm.com
Attributions
Pictures: https://unsplash.com / https://www.flickr.com/photos/wocintechchat
Gifs: https://giphy.com
Music: https://open.spotify.com/user/rvanosnabrugge/playlist/
0BWgsNPM5iwgk8ZGlMHeoY?si=l9-tV8FTR8S1J7AbKBz-KA
Video: https://www.youtube.com/watch?v=SGAnLY46zAk
Thanks: Martijn, Xpirit

Weitere ähnliche Inhalte

Was ist angesagt?

Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...ITSM Academy, Inc.
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIKnoldus Inc.
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaKeet Sugathadasa
 
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...Tori Wieldt
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesAshutosh Agarwal
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLADr Ganesh Iyer
 
SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1 SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1 Dr Ganesh Iyer
 
SRE-iously: Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously: Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously: Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously: Defining the Principles, Habits, and Practices of Site Reliabilit...New Relic
 
No more excuses QASymphony
No more excuses QASymphonyNo more excuses QASymphony
No more excuses QASymphonyQASymphony
 
AFITC 2018 - Using Process Maturity and Agile to Strengthen Cyber Security
AFITC 2018 - Using Process Maturity and Agile to Strengthen Cyber SecurityAFITC 2018 - Using Process Maturity and Agile to Strengthen Cyber Security
AFITC 2018 - Using Process Maturity and Agile to Strengthen Cyber SecurityDjindo Lee
 
SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2Dr Ganesh Iyer
 
Work Examiner Overview
Work Examiner OverviewWork Examiner Overview
Work Examiner Overviewandrewstingray
 
Measuring DevOps Performance
Measuring DevOps PerformanceMeasuring DevOps Performance
Measuring DevOps PerformanceBen Kohl
 
Technical Capabilities as enabler for Agile and DevOps
Technical Capabilities as enabler for Agile and DevOpsTechnical Capabilities as enabler for Agile and DevOps
Technical Capabilities as enabler for Agile and DevOpsNelis Boucké
 
Agile vs. waterfall - The fundamentals differences
Agile vs. waterfall - The fundamentals differencesAgile vs. waterfall - The fundamentals differences
Agile vs. waterfall - The fundamentals differencesDavid Tzemach
 
Agile and Auditors
Agile and AuditorsAgile and Auditors
Agile and AuditorsVersionOne
 
Top Metrics for Agile @Agile NCR2011
Top Metrics for Agile @Agile NCR2011Top Metrics for Agile @Agile NCR2011
Top Metrics for Agile @Agile NCR2011Priyank Pathak
 
3Points - IT Experts for Small Businesses
3Points - IT Experts for Small Businesses3Points - IT Experts for Small Businesses
3Points - IT Experts for Small Businesseskdoylecnc
 
SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1Dr Ganesh Iyer
 

Was ist angesagt? (20)

Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
 
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
 
SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1 SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1
 
SRE-iously: Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously: Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously: Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously: Defining the Principles, Habits, and Practices of Site Reliabilit...
 
No more excuses QASymphony
No more excuses QASymphonyNo more excuses QASymphony
No more excuses QASymphony
 
AFITC 2018 - Using Process Maturity and Agile to Strengthen Cyber Security
AFITC 2018 - Using Process Maturity and Agile to Strengthen Cyber SecurityAFITC 2018 - Using Process Maturity and Agile to Strengthen Cyber Security
AFITC 2018 - Using Process Maturity and Agile to Strengthen Cyber Security
 
SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2
 
Work Examiner Overview
Work Examiner OverviewWork Examiner Overview
Work Examiner Overview
 
Measuring DevOps Performance
Measuring DevOps PerformanceMeasuring DevOps Performance
Measuring DevOps Performance
 
Technical Capabilities as enabler for Agile and DevOps
Technical Capabilities as enabler for Agile and DevOpsTechnical Capabilities as enabler for Agile and DevOps
Technical Capabilities as enabler for Agile and DevOps
 
Agile vs. waterfall - The fundamentals differences
Agile vs. waterfall - The fundamentals differencesAgile vs. waterfall - The fundamentals differences
Agile vs. waterfall - The fundamentals differences
 
Agile and Auditors
Agile and AuditorsAgile and Auditors
Agile and Auditors
 
Top Metrics for Agile @Agile NCR2011
Top Metrics for Agile @Agile NCR2011Top Metrics for Agile @Agile NCR2011
Top Metrics for Agile @Agile NCR2011
 
3Points - IT Experts for Small Businesses
3Points - IT Experts for Small Businesses3Points - IT Experts for Small Businesses
3Points - IT Experts for Small Businesses
 
SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1
 

Ähnlich wie VSLive Orlando 2019 - When "We are down" is not good enough. SRE on Azure

When down is not good enough. SRE On Azure
When down is not good enough. SRE On AzureWhen down is not good enough. SRE On Azure
When down is not good enough. SRE On AzureRene Van Osnabrugge
 
How NOT to Measure Latency
How NOT to Measure LatencyHow NOT to Measure Latency
How NOT to Measure LatencyC4Media
 
S.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systemsS.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systemsRicardo Amaro
 
Misery Metrics & Consequences
Misery Metrics & ConsequencesMisery Metrics & Consequences
Misery Metrics & ConsequencesScyllaDB
 
Need for Speed: How to Performance Test the right way by Annie Bhaumik
Need for Speed: How to Performance Test the right way by Annie BhaumikNeed for Speed: How to Performance Test the right way by Annie Bhaumik
Need for Speed: How to Performance Test the right way by Annie BhaumikQA or the Highway
 
Waste Overview and Benefits
Waste Overview and BenefitsWaste Overview and Benefits
Waste Overview and BenefitsUbersoldat
 
Music City Agile 2019 - Measuring Flow: Metrics that Matter
Music City Agile 2019 - Measuring Flow: Metrics that MatterMusic City Agile 2019 - Measuring Flow: Metrics that Matter
Music City Agile 2019 - Measuring Flow: Metrics that MatterJulie Wyman
 
Reliable observability at scale: Error Budgets for 1,000+
Reliable observability at scale: Error Budgets for 1,000+Reliable observability at scale: Error Budgets for 1,000+
Reliable observability at scale: Error Budgets for 1,000+Fred Moyer
 
Kanban method: The Practices aren't the Point
Kanban method: The Practices aren't the PointKanban method: The Practices aren't the Point
Kanban method: The Practices aren't the PointJonathan Hansen
 
How to Predict Your Software Project's Probability of Success
How to Predict Your Software Project's Probability of SuccessHow to Predict Your Software Project's Probability of Success
How to Predict Your Software Project's Probability of Successkevinjmireles
 
BrightEdge Share15 - DM106: Integrated Campaign Management - Mark Fiske
BrightEdge Share15 - DM106: Integrated Campaign Management - Mark FiskeBrightEdge Share15 - DM106: Integrated Campaign Management - Mark Fiske
BrightEdge Share15 - DM106: Integrated Campaign Management - Mark FiskeBrightEdge Technologies
 
Agile Transformation: People, Process and Tools to Make Your Transformation S...
Agile Transformation: People, Process and Tools to Make Your Transformation S...Agile Transformation: People, Process and Tools to Make Your Transformation S...
Agile Transformation: People, Process and Tools to Make Your Transformation S...QASymphony
 
What's Measured Improves: Metrics that matter
What's Measured Improves: Metrics that matterWhat's Measured Improves: Metrics that matter
What's Measured Improves: Metrics that matterRaj Indugula
 
Doppelter Output in der halben Zeit - Wo bleibt die Qualität?
Doppelter Output in der halben Zeit - Wo bleibt die Qualität?Doppelter Output in der halben Zeit - Wo bleibt die Qualität?
Doppelter Output in der halben Zeit - Wo bleibt die Qualität?Mischa Ramseyer
 
SAQ Jahresversammlung; Doppelter Output in der halben Zeit, wo bleibt die Qua...
SAQ Jahresversammlung; Doppelter Output in der halben Zeit, wo bleibt die Qua...SAQ Jahresversammlung; Doppelter Output in der halben Zeit, wo bleibt die Qua...
SAQ Jahresversammlung; Doppelter Output in der halben Zeit, wo bleibt die Qua...pragmatic solutions gmbh
 
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel ...
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel ...DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel ...
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel ...Gene Kim
 
Value Stream Mapping: What to Do Before You Dive In
Value Stream Mapping: What to Do Before You Dive InValue Stream Mapping: What to Do Before You Dive In
Value Stream Mapping: What to Do Before You Dive InTKMG, Inc.
 
When do you need it by? Business Agility Metrics
When do you need it by? Business Agility MetricsWhen do you need it by? Business Agility Metrics
When do you need it by? Business Agility MetricsMartin Aziz
 
ThoughtWorks Continuous Delivery
ThoughtWorks Continuous DeliveryThoughtWorks Continuous Delivery
ThoughtWorks Continuous DeliveryKyle Hodgson
 
Lean Back Offices Project
Lean Back Offices ProjectLean Back Offices Project
Lean Back Offices ProjectParag Kapile
 

Ähnlich wie VSLive Orlando 2019 - When "We are down" is not good enough. SRE on Azure (20)

When down is not good enough. SRE On Azure
When down is not good enough. SRE On AzureWhen down is not good enough. SRE On Azure
When down is not good enough. SRE On Azure
 
How NOT to Measure Latency
How NOT to Measure LatencyHow NOT to Measure Latency
How NOT to Measure Latency
 
S.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systemsS.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systems
 
Misery Metrics & Consequences
Misery Metrics & ConsequencesMisery Metrics & Consequences
Misery Metrics & Consequences
 
Need for Speed: How to Performance Test the right way by Annie Bhaumik
Need for Speed: How to Performance Test the right way by Annie BhaumikNeed for Speed: How to Performance Test the right way by Annie Bhaumik
Need for Speed: How to Performance Test the right way by Annie Bhaumik
 
Waste Overview and Benefits
Waste Overview and BenefitsWaste Overview and Benefits
Waste Overview and Benefits
 
Music City Agile 2019 - Measuring Flow: Metrics that Matter
Music City Agile 2019 - Measuring Flow: Metrics that MatterMusic City Agile 2019 - Measuring Flow: Metrics that Matter
Music City Agile 2019 - Measuring Flow: Metrics that Matter
 
Reliable observability at scale: Error Budgets for 1,000+
Reliable observability at scale: Error Budgets for 1,000+Reliable observability at scale: Error Budgets for 1,000+
Reliable observability at scale: Error Budgets for 1,000+
 
Kanban method: The Practices aren't the Point
Kanban method: The Practices aren't the PointKanban method: The Practices aren't the Point
Kanban method: The Practices aren't the Point
 
How to Predict Your Software Project's Probability of Success
How to Predict Your Software Project's Probability of SuccessHow to Predict Your Software Project's Probability of Success
How to Predict Your Software Project's Probability of Success
 
BrightEdge Share15 - DM106: Integrated Campaign Management - Mark Fiske
BrightEdge Share15 - DM106: Integrated Campaign Management - Mark FiskeBrightEdge Share15 - DM106: Integrated Campaign Management - Mark Fiske
BrightEdge Share15 - DM106: Integrated Campaign Management - Mark Fiske
 
Agile Transformation: People, Process and Tools to Make Your Transformation S...
Agile Transformation: People, Process and Tools to Make Your Transformation S...Agile Transformation: People, Process and Tools to Make Your Transformation S...
Agile Transformation: People, Process and Tools to Make Your Transformation S...
 
What's Measured Improves: Metrics that matter
What's Measured Improves: Metrics that matterWhat's Measured Improves: Metrics that matter
What's Measured Improves: Metrics that matter
 
Doppelter Output in der halben Zeit - Wo bleibt die Qualität?
Doppelter Output in der halben Zeit - Wo bleibt die Qualität?Doppelter Output in der halben Zeit - Wo bleibt die Qualität?
Doppelter Output in der halben Zeit - Wo bleibt die Qualität?
 
SAQ Jahresversammlung; Doppelter Output in der halben Zeit, wo bleibt die Qua...
SAQ Jahresversammlung; Doppelter Output in der halben Zeit, wo bleibt die Qua...SAQ Jahresversammlung; Doppelter Output in der halben Zeit, wo bleibt die Qua...
SAQ Jahresversammlung; Doppelter Output in der halben Zeit, wo bleibt die Qua...
 
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel ...
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel ...DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel ...
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel ...
 
Value Stream Mapping: What to Do Before You Dive In
Value Stream Mapping: What to Do Before You Dive InValue Stream Mapping: What to Do Before You Dive In
Value Stream Mapping: What to Do Before You Dive In
 
When do you need it by? Business Agility Metrics
When do you need it by? Business Agility MetricsWhen do you need it by? Business Agility Metrics
When do you need it by? Business Agility Metrics
 
ThoughtWorks Continuous Delivery
ThoughtWorks Continuous DeliveryThoughtWorks Continuous Delivery
ThoughtWorks Continuous Delivery
 
Lean Back Offices Project
Lean Back Offices ProjectLean Back Offices Project
Lean Back Offices Project
 

Kürzlich hochgeladen

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

VSLive Orlando 2019 - When "We are down" is not good enough. SRE on Azure

Hinweis der Redaktion

  1. 3.6 Seconden
  2. Photo by Jay Heike on Unsplash
  3. Photo by Andrew Shiau on Unsplash
  4. Photo by Glen Carrie on Unsplash
  5. Photo by Glen Carrie on Unsplash
  6. ## Real problem Operations does not know the code base and have the strongest incentive to block a release. That is a problem. No focus on business value.. Photo by chuttersnap on Unsplash
  7. * Ultimately, the definition that is widely used, especially in our Microsoft eco-system is the definition of Donovan Brown. * DevOps is the union of Process, People and Products to enable continuous delivery of value to our end-users * Important to mention that most sessions, also of me cover the products and a bit of the process part but this session is really about the people part. Because, ultimately that is what it is all about. * This session talks mainly about the people part
  8. Photo by Derek Thomson on Unsplash Make people responsible 4-types of money
  9. Now that we've covered some of the most common aspects of reliability with which we should begin measuring, let's talk about how we represent those measurements for human consumption as well as allowing us to move one step closer to establishing thresholds to alert on user impacting problems.   Service Level Indicators (SLIs) are a ratio. Typically then multiplied by 100 to be represented as a proportion (of 100%).  99.99% reliable is an example of how SLIs are represented.
  10. You have to be VERY clear about WHERE the data is coming from. You have to make sure everyone is explicit about the data. "As we measured it at the load balancer" ... that is specific. Is the measurement from the server or the client?
  11. We’re going to start building an SLO In our example, we are going to use HTTP requests. But you decide what you are measuring (that is important to the customer) We have to be specific and clear about where the data is coming from (Example: As reported by the load balancer). This allows everyone to have a shared language across stakeholders in the company around reliability that everyone can agree on and move towards. Important! We are trying to have clarity when around these things so when we need to take action, we know what to do. It’s real important you have a time period, so everyone is clear on the expectations.
  12. https://microsoft.github.io/AzureBot/ Photo by Alex Litvin on Unsplash
  13. Photo by Tim Mossholder on UnsplashPhoto by Tim Mossholder on Unsplash
  14. You can see that what we monitor can vary depending on what we care about. The main thing to remember is that everything we exam MUST BE FROM THE PERSPECTIVE OF THE CUSTOMER. Was it available TO THE CUSTOMER. Measuring the availability of the component isn't that useful or helpful in understanding reliability.
  15. - Availability: Can my system answer a question it is asked?
  16. - Availability: Can my system answer a question it is asked?
  17. - Availability: Can my system answer a question it is asked?
  18. This indicator is generally used in one of two ways. For batch processing it could be the proportion of jobs that processed above some target amount of data. For streaming processing the proportion of records that were successfully processed within a window.
  19. - Correctness: Generally used when looking at pipelines as a measurement of some kind of processing on data. We measure for the proportion of records going into the pipeline that result in a valid and correct result coming out
  20. - Fidelity sometimes referred to as quality: Graceful failures. Lowering the fidelity but it's still reasonable. How often did I serve the entire page as I expected versus how often did I have to serve a simple plain site? If bandwidth limited, it may be acceptable to modify the image we provide for a request, so we intentionally have a policy to send images at a lower fidelity. So the measurement here would be the proportion of requests that were served in an undegraded state over the total number of records served.
  21. - Freshness: is the data being served current, is the cache refreshed/purged as needed. How often am I serving stale data
  22. - Durability: If you wrote your data to Azure storage or page blob do you have long-term data protection, i.e. the stored data does not suffer from bit rot, degradation or other corruption after an outage.
  23. Google Video Our SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time. At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features. Feature development typically focuses on improving reliability, performance, or utilization, which often reduces toil as a second-order effect. We share this 50% goal because toil tends to expand if left unchecked and can quickly fill 100% of everyone’s time. The work of reducing toil and scaling up services is the "Engineering" in Site Reliability Engineering. Engineering work is what enables the SRE organization to scale up sublinearly with service size and to manage services more efficiently than either a pure Dev team or a pure Ops team. Furthermore, when we hire new SREs, we promise them that SRE is not a typical Ops organization, quoting the 50% rule just mentioned. We need to keep that promise by not allowing the SRE organization or any subteam within it to devolve into an Ops team.
  24. Google SRE book definitions Scaling linearly as service grows: If you have something you have to do for 10 users, do you have to do it 10x as much for 100 users or 100x for 1000 users? Photo by Gary Chan on Unsplash
  25. https://microsoft.github.io/AzureBot/ Photo by Alex Litvin on Unsplash Self-service.. ChatOps.. Slack.. Can you reboot the server..
  26. Photo by Tim Mossholder on Unsplash
  27. Many complex components.observable system with rich data. Adjustments match expectations    No matter how our systems are architected, at the end of the day their sole purpose is to provide value to an end user. Expectations exist for the "customer". As such, we must know at any given time if the applications, infrasctructure, etc. are doing what we expect them to do? And do those expectations align with the customer's?    Engineers, teams, organizations, and the business as a whole has goals in which they set out to achieve. With each change made to our system it is important to understand if those changes are bringing a positive, negative, or neutral impact to any explicitly stated goals. Perhaps a KPI for one aspect of a system is to be able to handle a 10x increase in traffic should such a spike occur. What monitoring do we have in place that can successfully confirm if recent improvements (i.e. changes) actually solve for that goal.   In fact, it is the constant change of our systems that require us to know whether or not those changes are doing what we expected them to. It could be a new product or merely a hot-fix to existing services, but we must be able to measure the change that is being introduced to the system and determine if it is meeting our expectations?
  28. https://microsoft.github.io/AzureBot/ Photo by Alex Litvin on Unsplash Show PU site Show Live Metrics Show Log Analytics Show Application Map Show End to End Traces Show Azure Monitor
  29. Photo by Tim Mossholder on Unsplash
  30. Alert fatigue Titles – easier scanning Alerts have a specific purpose and are not logs. They should require a human to do something
  31. Netflix pulse of netflix
  32. https://microsoft.github.io/AzureBot/ Photo by Alex Litvin on Unsplash Self-service.. ChatOps.. Slack.. Can you reboot the server..
  33. Photo by Tim Mossholder on Unsplash
  34. Photo by Tim Mossholder on Unsplash
  35. Photo by Tim Mossholder on Unsplash
  36. “What was the thought process when the engineer took that action?”
  37. After we conduct a post-incident review, we should announce the availability of the meeting notes and any associated artifacts (e.g. timelines, chat logs, status page information). This information should be placed in a centralized location where the entire organization can access it and learn from the incident. That way we can translate local learnings and improvements into global learnings and improvements.
  38. After we conduct a post-incident review, we should announce the availability of the meeting notes and any associated artifacts (e.g. timelines, chat logs, status page information). This information should be placed in a centralized location where the entire organization can access it and learn from the incident. That way we can translate local learnings and improvements into global learnings and improvements.
  39. https://microsoft.github.io/AzureBot/ Photo by Alex Litvin on Unsplash Self-service.. ChatOps.. Slack.. Can you reboot the server..
  40. Photo by Tim Mossholder on Unsplash
  41. Marcel
  42. ## Rene - V ## * So this is me. Your personal gate for today * I work as a DevOps consultant * Normally with Dev Teams and such. But one of my latest assignments was at a bank * Security was something I never had a particular interest in. Just like many of us * Boring stuff, not nice. And very very restrictive * Until that moment when I needed to create "Secure Pipelines"
  43. What is your gender mentimeter
  44. As Eliyah Goldratt mentions in his book the Goal, there is ultimately one goal of every (commercial) organisation. And that is .. to make money. And nowadays, making money is to be better than your competitor, or even worse, your future competitor. * Cloud enables really small organisations to compete with really large ones. * With only a credit card, a good idea and some skillfull people you can make a difference Photo by Pepi Stojanovski on Unsplash
  45. https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started
  46. Based on Azure Data Explorer Underlying all the Azure Monitor capabilities Event logs, syslogs, custom logs, performance, app insights and more