Making Observability Actionable At Scale - DBS DevConnect 2019

Making Observability Actionable
At Scale
Sisir Koppaka
CTO
Squadcast
1
DevConnect Conference 2019
DBS Asia Hub, Academy
Singapore

Hi there !
Squadcast - Building a simple and free incident
response tool to help increase adoption of Site
Reliability Engineering (SRE)
Built real-time data science pipelines at two
different startups in NYC and BLR
Can disease diagnosis and tracking be automated
with ultrasound? - Research at MIT
Studied Reliability & Production Engineering at IIT
Kharagpur
What I’m deﬁnitely NOT!
An expert in Banking
2
Building reliable software
at scale is really hard.
School of Hard Knocks

Squadcast
System of engagement for managing
reliability end-to-end combining human +
machine data
3
Democratize SRE!
Service Deﬁnitions
Service Level Objectives (SLOs)
Service Level Indicators (SLIs)
Error Budgets
And
ACTIONS! - what we’ll focus on for this talk

4
What Service Level Objectives (SLOs) Look Like
SLOs, SLIs, Error Budgets and SRE best practices like
generic mitigation help limit toil and help you turn
the vicious cycle into a virtuous cycle

6
Excellent
What does Observability
really mean?
How well are you able to
infer a system’s internal
state given it’s output ?
System
Input Output
Not-so-good
Proactive Customer Success
High TBF
Low TTA, TTR
Transparency
Predictable Change Velocity
Low Toil
Sticky Customers
Reactive Customer Success
Low TBF
High TTA, TTR
Lack of Transparency
Unpredictable Change Velocity
High Toil
Meh Customers
Observability
...

7
Pillars of
Observability*
Logs
Metrics
Traces
*an apolitical rendition
What does Observability
really mean?
How well are you able to
infer a system’s internal
state given it’s output ?
System
Input Output

8
What the data tells us at
Squadcast!
Time-to-act (TTA) and Time-to-resolve (TTR)
are on average larger and more variable
outside the main working shift
Incident Response globally could be more
consistent, transferable, and scaleable
within organizations. Response patterns
cannot be versioned or programmed
against.
Similar to CI/CD circa 2005.
Are we at peak observability as a
community? No. If we can’t act
effectively, we cannot claim peak
observability.
*Normalized across three 8-hour shifts across the world. Data is not representative of any
individual customer.

9
A Deeper Look
(SRE teams at 72 companies)
A majority of respondents
considered themselves SREs (at
well-known companies).
56% were managing between
50-500 services, and 32% were
managing 10-50 services.

We may need a fourth
pillar to optimize for
peak observability by
building an active
knowledge repository
of Actions.
11
Pillars of
Observability*
Logs
Metrics
Traces
Actions
Data
Impact

What are Squadcast
Actions?
Quick Demo
12
Actions
- Primitives
squadctl circleci:rebuild
platform-js/master/latest
squadctl namespace:action :repo/branch/tag
- Runbooks
for the long tail of response activity
Markdown-supported active runbooks in a
language of your choice

Building Actions
A few things we learnt
along the way
13
Don’t Repeat Yourself (DRY)
Audit Trails with immutable log
Continuous Security
Composing Action Primitives into
Workﬂows
Continuous Feedback in the SDLC
Heterogeneous Workloads become easier
to support
Hybrid Cloud
And many more...

14
Let’s look at a real example
A Fortune 100 Enterprise has over 100 TB of release artifacts, growing
at double-digit % every year. They have different Engineering teams
for each product line, have a NOC that routes production incidents
to the appropriate team, have a SOC…..
Can we unlock additional value by taking more actions
during incident response that improves observability, and
thereby, the change velocity?
Use Cases
➔ Automatically ﬂagging build artifacts for telemetry spikes, and rolling back
➔ Flagging build artifacts for new vulnerabilities and automated rollbacks
➔ Scaling production environment based on external events such as trafﬁc
spikes
➔ And many more

15
Release Promotion and the SRE Loop
For a simple workload
1
2
3
V
C
S
Dev
Artifacts
Quality
Gate
Staging
Artifacts
Quality
Gate
GA
Artifacts
Quality
Gate
Production
Artifacts
Quality
Gate
Triage
Generic
Remediation
SLO
Breach
Incident
Routing
Root Cause
Analysis

16
Release Promotion and the SRE Loop
1
2
3
V
C
S
Dev
Artifacts
Quality
Gate
Staging
Artifacts
Quality
Gate
GA
Artifacts
Quality
Gate
Production
Artifacts
Quality
Gate
Triage
Generic
Remediation
SLO
Breach
Incident
Routing
Root Cause
Analysis
Motivation
Improving Observability can reduce the drag force on change velocity

17
Drag Force Reduction At Scale
With Superior Traceability
- Backpropagate accurate and
real-time metadata associated with
releases to JFrog Artifactory
(example used hee) or Sonatype
Nexus
- Use metadata to programmatically
drive incident response using
Artifactory Query Language in
Squadcast Runbooks
Quick Demo

18
How Squadcast Works
- Squadcast Actions and Runbooks
which trigger programmatic
response during incident response
- Human-in-the-loop,
machine-assisted
- Primitives can be composed -
primitives to snippets to more
complicated workﬂows
- Functional from all interfaces
including mobile, ‘coz incidents
happen anytime, anywhere.

19
Known Known
Ex - that telemetry spike
Automate
Known Unknowns
Ex - External Trafﬁc Spikes
Prepare, then
human-in-loop
Unknown Knowns
Ex - Vulnerabilities
Prepare, then
human-in-loop
Unknown Unknowns
Convert to others
Let’s start the clock!
Understanding Failure Modes

20
Known Known
Ex - telemetry spike
Automate
Known Unknowns
Prepare, then
human-in-loop
Unknown Knowns
Prepare, then
human-in-loop
Unknown Unknowns
Convert to other 3 types
Let’s start the clock!
What we’ll take a look at in the Demo
Responding to Failure Modes

21
DEMO
1. Improving traceability by building a loop
between release metadata / change
requests and incident response
2. Enrich production context by annotating
Actions more comprehensively in your
visualization tool like Grafana
3. Try at home - Improve and automate
response to vulnerabilities on a real-time
basis (you can start with automating
response to vulnerabilities from Snyk)
Known Known
Automate
Unknown Knowns
Prepare, then
human-in-loop
Known Unknowns
Prepare, then
human-in-loop

22
Known Known
Automate

23
Known Known
Automate

24
Here’s one more idea...
Actions help make your system more Observable.

What does the modern
enterprise gain from
the fourth pillar of
Observability?
25
Top 3 Priorities of the Modern Enterprise*
88% Revenue Acceleration
71% Improved Agility and faster Time to
Market
47% Cost Reduction
29% Better Management of Regulatory and
Compliance Risks
29% Increased CSAT
41% Other (Brand, Strategic, Financial)
*McKinsey Digital Survey of CIOs/CTOs at 52 enterprises. 78 percent work at
orgs with 5,000+ employees, and 44 percent work at companies with annual
revenues of $10 billion+

26
Thank you!
t: @sisirkoppaka / @squadcastHQ
e: sisir@squadcast.com

Making Observability Actionable At Scale - DBS DevConnect 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to Making Observability Actionable At Scale - DBS DevConnect 2019

Similar to Making Observability Actionable At Scale - DBS DevConnect 2019 (20)

Recently uploaded

Recently uploaded (20)

Making Observability Actionable At Scale - DBS DevConnect 2019