SRVision 2019, Utrecht: Swarming and Cynefin

Swarming and Cynefin
SRVision 2019, Utrecht
Jon Hall
Principal Product Manager, Digital Service Management
© Copyright 2019 BMC Software, Inc.
@jonhall_

Escalation
Escalation
Recap: Tiered support
LEVEL 2 SUPPORT LEVEL 2 SUPPORTLEVEL 2 SUPPORT
LEVEL 1 SUPPORT
LEVEL 3 SPECIALISTS LEVEL 3 SPECIALISTS LEVEL 3 SPECIALISTS LEVEL 3 SPECIALISTS
@jonhall_

Swarming involves removing the tiers
of support, and calling on the collective
expertise of a “swarm” of analysts.
https://www.serviceinnovation.org/intelligent-swarming/
Swarming defined
@jonhall_

Local Product Line
Support Teams
Severity 1
Swarm
Local Dispatch Swarm
Prioritise
Severity 1
Swarm
Local Dispatch Swarm
Prioritise
Local Product Line
Support Teams
Swarming example: BMC’s Sev-1 and Dispatch Swarms
@jonhall_

• Rapid responders
• Three agents on a scheduled one-week rotation
• Primary focus: Provide immediate response, and resolve as soon as
possible
Swarm lead
Communications
Other members
Research, coordinate, test
Severity 1 Swarm
@jonhall_

• “Cherry pickers”
• Meet every 60-90 minutes
• Primary focus: Can new tickets be resolved immediately?
• Also: Validation of ticket details before assignment to specialists
Experienced analyst Less-experienced analyst
Dispatch Swarm
@jonhall_

Local Product Line Support Teams Local Product Line Support Teams
Backlog Swarm Backlog Swarm Backlog Swarm
Swarming example: BMC’s “Backlog Swarms”
@jonhall_

• Global fixers of troublesome tickets
• Meet regularly (often several times a day)
• Primary focus: Challenging 3rd-line tickets
• Replace reassignments and individual assignments
Experienced analysts R&D Engineers
Backlog Swarms
@jonhall_

Swarming Example: Drop-in SME support for Service Desk
@jonhall_
CUSTOMER CHAT SESSIONS
Service Desk Agents
CHAT
CHANNEL
Subject Experts
CHAT
CHANNEL
Subject Experts
CHAT
CHANNEL
Subject Experts
• Regional chat-based service
desk at a global Telco
• Agents can put customer on-
hold for 3 minutes
• Subject experts wait in
“always-on” chat channels

Swarming Example: Auto manufacturer’s connected cars team
@jonhall_
Engineering Team A
• First responder initiates and
coordinates swarms for big issues
• Other teams have 1 person on
rotation for swarming
• Swarms may also involve 3rd parties
(e.g. Amazon, Microsoft)
• Swarm grows and shrinks as needed
Engineering Team B 3rd Party Suppliers
First Responder
Challenge: Scaling from small beginnings to millions of vehicles

Application1 Application2
@jonhall_
Developers
Support SpecialistsOperations Team
Scenario: Government agency with a growing DevOps initiative
Before transformation…
• Traditional tiered teams for
Operations and Support
• Common pool of developers,
assigned and reassigned to tasks
across multiple projects
Swarming Example: “Always-on” Swarming

Application1
@jonhall_
Developer
Swarming Example: “Always-on” Swarming
Scenario: Government agency with a growing DevOps initiative
After transformation…
• Product, not project thinking
• Team leaders have autonomy to
create and change teams
• Support professionals embedded
in full-stack teams
Application2
Operator Support Specialist

• Work-in-progress queues
• Asynchronous communication
• Single role teams
• Individual over-exposure
• Lack of knowledge sharing
How to annoy a DevOps practitioner
@jonhall_

@jonhall_
DevOps is mainstream. Sample speakers from Devops Enterprise 2018:

Deployment frequency:
Change lead time:
Mean time to recover:
Change failure rate:
46x higher
2555x faster
2604x faster
7x lower
ITSM is under significant pressure from DevOps…
2018 State of DevOps Report

2018 State of DevOps Report
But… Service Management has a lot to offer to DevOps
@jonhall_

• New services and applications suddenly appear
• Lost visibility when issues go to developers
• Lack of knowledge sharing
• New kinds of customer, especially external
DevOps challenges Service Desk orthodoxies…
@jonhall_

• Scaling customer support
• Understanding the context of an issue
• Adaptation to life “on call”
• What to prioritise? Fix bugs or build new stuff?
• How to process alerts, particularly if noisy/low-quality.
…but enterprise realities challenge DevOps
@jonhall_

DevOps teams aren’t as ITSM-phobic as some think
“I need to understand
drifts, timelines…”
“The person who is on call at
4am needs to know who has
been doing what”
“Context is a trigger word for me...
in a company of 4000 people,
things can get out of hand really
fast if you don't have context”
“What is actually running
on an environment?”
“If you're dropped in the
middle of something,
how did you get here?"
(Real quotes from conversations at Configuration Management Camp, Ghent)

“The enterprise space doesn’t move slowly
because they’re stupid, or they hate technology.
It’s because they have users”
Luke Kanies, Puppet Founder, Configuration Management Camp 2015, Belgium.
@jonhall_

Swarming aligns really well to DevOps
• Autonomy and self-organisation
• Knowledge transfer and skills development
• ChatOps, not email
• Prevention of accumulation of queued work
• Protection of individuals from burnout
@jonhall_

We face an issue:
The tiered support system constrains ITSM’s
ability to adapt to new practices and thinking.
@jonhall_

• Pronounced “kuh-nev-in”
• Developed by Dave Snowden while at IBM in 1999
• “A decision support framework which comes from a
mixture of complexity theory and cognitive
science… the opposite of a one-size fits all model”
Cynefin: An example of new thinking
@jonhall_

@jonhall_
• Obvious and Complicated domains:
• Repeating relationship between cause and effect
• With Complicated you need to do analysis to find
that relationship
• Complex domain:
• Understanding the problem requires
experimentation and analysis.
• May, over time, be able to move to Complicated
• Chaotic domain:
• Dramatic and unconstrained
• Focus on damage limitation, try to move to
another domain
Cynefin “Domains” – an overview

“Obvious” Domain
@jonhall_
• “Sense, Categorise, Respond”
• Can apply best practice
• Template/knowledge-driven resolution
• Self service

“Complicated” Domain
@jonhall_
• “Sense, Analyse, Respond”
• Good practice.
• Dispatch-type swarm – pair up agents with varied experience
• Capture detailed knowledge for organizational learning
• Suits a “Dispatch Swarm” type approach?
Swarm
Lead
Swarm
Assistant

• Not acting is not an option: act immediately, observe impact
• Try to move from Chaotic to Complex by introducing constraints
• Chaos may be an opportunity to innovate
@jonhall_
Response Lead
Customer LiaisonDamage limitation/restoration Innovation
Swarming in response to a Chaotic situation
Planned Response

Enterprise systems are complex
@jonhall_

The impact of Complexity
@jonhall_
Charity Majors - Observability for emerging infra
Config Management Camp, Ghent 2019
“Distributed
systems have an
infinite list of
almost impossible
failure scenarios"

Some Complexity theory…
@jonhall_
• Complex systems contain mixtures of latent failures
• It’s impossible not to have multiple flaws
• The failures change constantly
• Complex systems run as broken
• Operating complex systems needs human expertise
• Issues have multiple causes, not a single root-cause
“How Complex Systems Fail” (1998) - Richard I. Cook, MD
Cognitive Technologies Laboratory, University of Chicago

Complex systems fail in complex ways
@jonhall_
“All twenty app services have 10% of nodes enter a simultaneous crash
loop cycle, about five times a day, at unpredictable intervals.
It clears up before we can debug it, every time”
“We run a platform, and it’s hard to distinguish between problems that
users are inflicting on themselves, and problems in our own code,
since they all manifest as the same errors or timeouts”.
“I have 20 microservices and three datastores across three regions, and
everything seems to be getting a little slower over the past 2 weeks
…but nothing has changed that we know of.
Latency is usually back to the historical norm on Tuesdays”
Who would you assign to? Charity Majors
Observability for emerging infra
Config Management Camp, Ghent 2018

Identify
“coherent”
hypotheses
Cynefin approach to a Complex issue
@jonhall_
• “Sense, Analyse, Respond”
• Identify multiple hypotheses
• Gain understanding of the system by interacting with it
• Create predictability, increase constraints, try to move to Complicated
Convene “safe
to fail”
experiments
Observe and
monitor impact
Amplify good
patterns,
dampen bad

Swarm
Lead
Assistant
Lead
Network
Specialist
Developer
Swarm
Lead
Assistant
Lead
Swarm
Lead
Network
Specialist
Vendor
Agent
Assistant
Lead
Developer Server
Technician
Swarm
Lead
Vendor
Agent
Developer
1. Initiate Analysis
• Detect complexity
• Clarify context
• Identify initial team
2. Establish Theories
• Gather information
• Form hypotheses
• Identify subgroups
3. Experiment and observe
• Parallel safe-to-fail experiments
• Observe and measure
• Amplify or dampen outcomes
4. Respond
• Assemble resolution team
• Release non-necessary people
• Resolve issue, document steps
Swarming in response to complex issue
“Probe, Sense, Respond”
@jonhall_
This could not work in a siloed, tiered structure!

The way forward
@jonhall_
• ITSM must adapt to retain relevance and credibility
• Over-constrained, inflexible practices will stifle this adaptation
• ITIL® v4 is a good step forward: giving more room to develop new
approaches to practices
• It’s a good time to be an innovative thinker

Swarming appearing in ITSM frameworks
ITIL® 4 Foundation (2019)
VeriSM – A service management
approach for the digital age (2017)

serviceinnovation.org/intelligent-swarming
Some more information
@jonhall_
http://www.bosslevelpodcast.com
/dave-snowden-on-complexity-
theory-and-astrology/
http://medium.com/@jonhall_https://www.youtube.com/watch?
v=fOdtgHu_KeA
(I’ve just tweeted these links)
Consortium for Service Innovation:
Intelligent Swarming
Boss Level Podcast:
Dave Snowden on Cynefin
Long-form blog on why
Swarming works better for DevOps
Charity Majors at #cfgmgmtcamp
Observability for emerging infra
@mipsytipsy
@snowded

SRVision 2019, Utrecht: Swarming and Cynefin

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie SRVision 2019, Utrecht: Swarming and Cynefin

Ähnlich wie SRVision 2019, Utrecht: Swarming and Cynefin (20)

Mehr von Jon Stevens-Hall

Mehr von Jon Stevens-Hall (17)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

SRVision 2019, Utrecht: Swarming and Cynefin

Hinweis der Redaktion