SlideShare a Scribd company logo
1 of 150
Damon Edwards
Incident Management
in the Age of DevOps and SRE
November 12, 2019
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
incident-management-devops-sre/
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
Damon Edwards
Incident Management
in the Age of DevOps and SRE
@damonedwards
November 12, 2019
Assertion:
The ability to respond to and resolve incidents is the true
indicator of an organization’s operational capabilities
Assertion 2:
Everybody now works in “Operations"
What Is an Incident?
An unplanned disruption impacting
customers or business operations
What Is an Incident?
An unplanned disruption impacting
customers or business operations
Outages
Service Degradation
What Is an Incident?
An unplanned disruption impacting
customers or business operations
Outages
Service Degradation
Work interruption
Delay/Waiting
“Short-Notice” Requests
Board
Integrated
Board
Integrated
Responsive
Board
Integrated
Responsive
Everywhere
Board
Integrated
Responsive
Everywhere
Always
Board
Integrated
Responsive
Everywhere
Always
Board
Tech Org Execution
Integrated
Responsive
Everywhere
Always
Board
Tech Org Execution
Kubernetes
AWS GCP Azure
Docker
Consul
Terraform Istio
Zipkin
Envoy
Serverless
OpenShift
KafkaLamba
Prometheus
Containerd
Helm
Cloud Foundry
Linkerd
Etcd
CoreDNS
MongoDB
Redis
InfluxDB
Jaeger
gRPC
CRI-O
Cognito
Fargate
Cloud Functions
Cosmos
BigQuery Spark
Rook
Ceph
NGINXHAProxy
Open vSwitch
NSX Sensu
Vault
Aurora
Nomad
Kubernetes
AWS GCP Azure
Docker
Consul
Terraform Istio
Zipkin
Envoy
Serverless
OpenShift
KafkaLamba
Prometheus
Containerd
Helm
Cloud Foundry
Linkerd
Etcd
CoreDNS
MongoDB
Redis
InfluxDB
Jaeger
gRPC
CRI-O
Cognito
Fargate
Cloud Functions
Cosmos
BigQuery Spark
Rook
Ceph
NGINXHAProxy
Open vSwitch
NSX Sensu
Vault
Aurora
Nomad
Kubernetes
AWS GCP Azure
Docker
Consul
Terraform Istio
Zipkin
Envoy
Serverless
OpenShift
KafkaLamba
Prometheus
Containerd
Helm
Cloud Foundry
Linkerd
Etcd
CoreDNS
MongoDB
Redis
InfluxDB
Jaeger
gRPC
CRI-O
Cognito
Fargate
Cloud Functions
Cosmos
BigQuery Spark
Rook
Ceph
NGINXHAProxy
Open vSwitch
NSX Sensu
Vault
Aurora
Nomad
Kubernetes
AWS GCP Azure
Docker
Consul
Terraform Istio
Zipkin
Envoy
Serverless
OpenShift
KafkaLamba
Prometheus
Containerd
Helm
Cloud Foundry
Linkerd
Etcd
CoreDNS
MongoDB
Redis
InfluxDB
Jaeger
gRPC
CRI-O
Cognito
Fargate
Cloud Functions
Cosmos
BigQuery Spark
Rook
Ceph
NGINXHAProxy
Open vSwitch
NSX Sensu
Vault
Aurora
Nomad
SAIL/cornell.edu
Adrian Cockcroft
Developer
Developer
Developer
Developer
Developer
Old Release Still
Running
Release Plan
Release Plan
Release Plan
Release Plan
Deploy
Feature to
Production
Deploy
Feature to
Production
Deploy
Feature to
Production
Deploy
Feature to
Production
Bugs
Deploy
Feature to
Production
Immutable microservice deployment
scales, is faster with large teams and
diverse platform components
DockerCon EU 2014 Architecture enables speed.
Speed is the advantage.
The Three Ways (2013)
The Three Ways (2013) The Five Ideals (2019)
DEV
Go! Go! Go!DEV
Go! Go! Go!DEV …OPS?
0000
Go! Go! Go!DEV …OPS?
0000
Go! Go! Go!DEV …OPS?
Operations:
The Last Mile
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Principles of SRE
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Principles of SRE
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Principles of SRE
DevOps + SRE
Product,
Not Project
Continuous
Delivery
Shift
Left
Error
Budgets
0
100
!!
Toil
Limits
Cloud
Native+ + + + +
DevOps + SRE
Product,
Not Project
Continuous
Delivery
Shift
Left
Error
Budgets
0
100
!!
Toil
Limits
Cloud
Native+ + + + +
Dev Ops
Cross-Functional Team
Cross-Functional Team
DevOps + SRE
Product,
Not Project
Continuous
Delivery
Shift
Left
Error
Budgets
0
100
!!
Toil
Limits
Cloud
Native+ + + + +
Dev Ops
Cross-Functional Team
Cross-Functional Team
DevOps + SRE
Product,
Not Project
Continuous
Delivery
Shift
Left
Error
Budgets
0
100
!!
Toil
Limits
Cloud
Native+ + + + +
“Value-Aligned” and Self-Regulating
Shared
Responsibility
Model
Traditional ITSM
Traditional ITSM
ITIL
1989 - ?
Traditional ITSM
ITIL
1989 - ?
Traditional ITSM
Unintentionally Encourages Silos
ITIL
1989 - ?
Traditional ITSM
X X X XX X
Unintentionally Encourages Silos
ITIL
1989 - ?
Traditional ITSM
X X X XX X
Unintentionally Encourages Silos
Encourages command
& control management
ITIL
1989 - ?
Traditional ITSM
X X X XX X
Unintentionally Encourages Silos
Encourages command
& control management
ITIL
1989 - ?
Old Way
New Way
Old Way
New Way
+
REDeploy.io
There is no root cause.
(That’s just a political distinction)
REDeploy.io
Why?
Why?
Why?
Why?
Why?
There is no root cause.
(That’s just a political distinction)
REDeploy.io
Why?
Why?
Why?
Why?
Why?
There is no root cause.
(That’s just a political distinction)
Right,
Wrong,
Safety II,
and You.
REDeploy.io
Why?
Why?
Why?
Why?
Why?
There is no root cause.
(That’s just a political distinction)
Right,
Wrong,
Safety II,
and You.
Incidents = unplanned investments
REDeploy.io
You
Not
18Million
IT Ops
22.3Million
Developers
Col. John Boyd
OODA Loop
Monitoring
Spotting the knowns
Monitoring
Spotting the knowns
Observability
Interrogating the unknowns
Observability
Interrogating the unknowns
Observability
Interrogating the unknowns
Logging: The event
Observability
Interrogating the unknowns
Logging: The event
Metrics: Data points over time
Observability
Interrogating the unknowns
Logging: The event
Metrics: Data points over time
Tracing: Events in context of a single request
Observability
Interrogating the unknowns
Logging: The event
Metrics: Data points over time
Tracing: Events in context of a single request
Automated Governance
Objective automated attestation of
GRC controls
Automated Governance
Objective automated attestation of
GRC controls
Automated Governance
Objective automated attestation of
GRC controls
Monitoring
Observability
Governance
Everyone
Everyone
Everyone
Everyone
Incident Command
Mobilization, Coordination, Communication
Incident Command
Mobilization, Coordination, Communication
Incident Command System
(FEMA)
Incident Command
Mobilization, Coordination, Communication
Incident Command System
(FEMA)
Incident Command
Mobilization, Coordination, Communication
Incident Command System
(FEMA)
Incident Command
Mobilization, Coordination, Communication
Incident Command System
(FEMA)
Incident Command
Mobilization, Coordination, Communication
Incident Command System
(FEMA)
GitHub: PagerDuty/incident-response-docs
Ops = Platform Eng + SRE
Divide and conquer
Ops = Platform Eng + SRE
Divide and conquer
Ops Platform Eng + SRE
Divide and conquer
SRE: Expert Operators (distributed)
Platform Eng: Build and Operate Platform Services (centralized)
Ops Platform Eng + SRE
Divide and conquer
SRE: Expert Operators (distributed)
Platform Eng: Build and Operate Platform Services (centralized)
Ops Platform Eng + SRE
Divide and conquer
SRE: Expert Operators (distributed)
Platform Eng: Build and Operate Platform Services (centralized)
New Views on Escalations
Avoid… but swarm if you do
Support at
the edge
Swarm
Diagnose: Health checks, exploratory actions
Take Action!
Restore: Restart, repair actions, rollback
The Return of Runbooks
Awhile ago Not that long ago Now
The Return of Runbooks
Awhile ago Not that long ago Now
Runbooks
(Mostly Manual)
The Return of Runbooks
Awhile ago Not that long ago Now
Runbooks
(Mostly Manual) …
The Return of Runbooks
Awhile ago Not that long ago Now
Runbooks
(Mostly Manual)
Runbooks
(Automate!…How?)…
Thanks SRE!
Runbook Automation
Safe self-service access to the expert knowledge
you need to take action.
Runbook Automation
Safe self-service access to the expert knowledge
you need to take action.
Runbook Automation
Safe self-service access to the expert knowledge
you need to take action.
Runbook Automation
Safe self-service access to the expert knowledge
you need to take action.
Moving the bits is the easy part!
Runbook Automation
Safe self-service access to the expert knowledge
you need to take action.
Empower those closest to the action!
Runbook Automation
Safe self-service access to the expert knowledge
you need to take action.
Runbook Automation
Safe self-service access to the expert knowledge
you need to take action.
De-risk!
Runbook Automation
Safe self-service access to the expert knowledge
you need to take action.
Before Runbook Automation…
Before Runbook Automation…
3 options:
1. Decipher the wiki
Before Runbook Automation…
3 options:
1. Decipher the wiki
2.Ad-hoc tool/script usage
Before Runbook Automation…
3 options:
1. Decipher the wiki
2.Ad-hoc tool/script usage
3.ESCALATE!
Before Runbook Automation…
3 options:
…with Runbook Automation
Shorter Incidents. Fewer Escalations.
Before RBA
Shorter Incidents. Fewer Escalations.
Before RBA
With RBA
Shorter Incidents. Fewer Escalations.
With RBA
Shorter Incidents. Fewer Escalations.
Before RBA
Shorter Incidents. Fewer Escalations.
With RBA
Shorter Incidents. Fewer Escalations.
Solve Difficult Security & Compliance Problems
Before RBA
Solve Difficult Security & Compliance Problems
With RBA
Everything Through a SDLC
Promote
Runbooks as a Service
Incidents = unplanned investments …the ROI is up to you.
Recap!
Elevate the Human.
@damonedwards
damon@rundeck.com
Let’s talk…
Special thanks to
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
incident-management-devops-sre/

More Related Content

More from C4Media

More from C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Incident Management in the Age of DevOps & SRE