Managing incidents in a DevOps environment is a near insurmountable task. With shared responsibilities and on-call rotations, anyone might be called into a system firefight at any time. Accepting failure and the problems created with complex system is a core tenet of DevOps thinking, and helping your team respond to incidents more effectively is key.
Matthew Boeckman has served on the frontlines of DevOps incident management for 19 years. He’s seen it all and is an expert on building teams and workflows to support effective alerting, clear communication, and rapid recovery.
3. 3
What is VictorOps?
VictorOps ingests all of your alerts from your current monitoring tools and becomes the logical
layer between your alerts and the people who receive them.
4. 4
Five Phases of Incident Management
Detection
● monitoring
● metrics
● thresholds
Response
● alerting
● on-call
● escalation
Remediation
● fixes
● tickets
● deployment
Analysis
● postmortem
● how or why
● understand
Readiness
● improve
● game days
● learning
1 2 3 4 5
10. 10
#1 Blended Approach to Detection
● Synthetic Testing
● Time-series data
● Application Monitoring
● Log Analytics
11. 1
Detection - Synthetic Testing
Existing
User?
Synthetic monitoring leverages scripted web interactions to
validate critical path user interactions (or system interactions).
Landing
Page
Registration
Welcome
Stream
User
Home
User
Home
InteractionLogin
12. 12
Detection - Time Series
Systems are not static.
Why treat measurements as if they were?
14. 14
Detection - Application Performance (APM)
RUM
Transactions
Page
Performance
Thread
Profiling
Application
internals
Timings and
Counters
Microservices
Transactional
path
3rd party
calls
Dependency
Management
Circuit
Breakers
APM monitors complex application behaviors
and reports or alerts on deviation from norms.
UX Runtime Tracing Internals
15. 15
Detection - Log Analytics
Log Analysis opens new detection avenues with insight
particularly into security and compliance concerns.
16. 16
#2 Focus on Business Outcomes#2 Focus on Business Outcomes2
17. These datapoints become Detectable conditions
for actionable alerting.
17
#2 Focus on Business Outcomes
Business Activity Monitoring maps key metrics
to data or flows within the IT environment.
22. 22
Alert Fatigue is a Leading Cause of Burnout
“I get paged for issues that I
can’t resolve; most of my time is
either researching a problem that
is transient or non reproducible,
or contacting a vendor.”
“I’m on call for everything from
infrastructure issues to, much
more often, application issues for
software I didn’t write and don’t
own.”
23. 23
Actionability Exists on Two Dimensions
The alert must be actionable, and differentiated
from something that’s merely informational.
24. Alerts must route to someone who has the access,
permission, and skills to adequately perform said action.
24
Actionability Exists on Two Dimensions