DevOps Enterprise Summit 2015 presentation with Kevina Finn-Braun, Director of SRE Management at Salesforce: this is the story of my months-long journey with Kevina and her team to identify the specifics of what made reliability retrospectives difficult to have, why actionable takeaways were often lacking, and how the feedback loops within the company’s operations organization weren’t serving Salesforce’s needs.
We then ran a series of experiments together, putting the SRE team on a road to improving their ability to respond, react, remediate, and reincorporate learnings from failure into the organization.
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
The Blameless Cloud: Bringing Actionable Retrospectives to Salesforce
1. K E V I N A F I N N - B R A U N
S A L E S F O R C E
J . PA U L R E E D
R E L E A S E E N G I N E E R I N G A P P R O A C H E S
D E V O P S E N T E R P R I S E S U M M I T, 2 0 1 5
T H E B L A M E L E S S C L O U D :
B R I N G I N G A C T I O N A B L E R E T R O S P E C T I V E S
T O S A L E S F O R C E
2. K E V I N A
F I N N - B R A U N
• Director of Site Reliability Service
Management at Salesforce
• Business Continuity at Yahoo
• Geeks out on Group Dynamics and
Behavior
• @kfinnbraun on
• Prepping for the zombie
apocalypse
@kfinnbraun @jpaulreed#DOES15
3. J . PA U L
R E E D
• @jpaulreed on
• Host of The Ship Show,
@shipshowpodcast on
• Principal Consultant, Release
Engineering Approaches
• Spend my days talking to
organizations about
“The DevOps™”
@kfinnbraun @jpaulreed#DOES15
4. “ S I T E R E L I A B I L I T Y ”
AT S A L E S F O R C E
• Primary operational team
supporting availability
• Acceptance and validation
activities
• Develop and implement
operational improvements for
SFDC
• “Game days”
@kfinnbraun @jpaulreed#DOES15
5. S E R V I C E R E L I A B I L I T Y H U R D L E S AT S F D C
• Inconsistent application of process, leading to inconsistent
information collection
• Incident handling/remediation crossing silo boundaries
• Confusion over service ownership, due to restructured responsibilities
• Disjointed, “heavyweight” meetings
• Postmortems centered around “The Old View” of human error
@kfinnbraun @jpaulreed#DOES15
6. L A N G U A G E O F
T H E “ O L D V I E W ”
• “5 whys”
• “Root cause” analysis
• “Why didn’t you[r team]…”
• “You[r team] should have…”
• “Best practices”
@kfinnbraun @jpaulreed#DOES15
8. T H E T I M E L I N E
• October 2014: First Meeting
• January 2015: “Blow up” HA Forum
• April 2015: Status Check, including
assessment shared with senior
leaders
• May 2015: Service ownership roles
shift
@kfinnbraun @jpaulreed#DOES15
9. T H E T I M E L I N E
• October 2014: First Meeting
• January 2015: “Blow up” HA Forum
• April 2015: Status Check, including
assessment shared with senior
leaders
• May 2015: Service ownership roles
shift
• July 2015: Initial Workshop on “The
New View”
• August 2015: Identified first group for
coaching
• August 2015 — today: Continued
focus and deep-dive on WSRR
• August 2015 — today: Weekly
sessions with the initial group
@kfinnbraun @jpaulreed#DOES15
10. Incident, Event,
Bug
Initial
Analysis
RC
Known?
Facilitator opens
investigations
and schedules
post mortem
meeting
Request RCA/
Failure Analysis
N
RC
Identified?
Identify corrective
actions and
implementation
plans; Assign
actions to scrum
teams
Y
RCM
Needed?
RCM
Process
Unable to
ascertain root
cause; update
record with “KE
Status”
Engage scrum
teams as required.
HA Forum
Y
N
Corrective
Actions
complete?
Weekly meetings
to follow up with
scrum master on
progress
Review
@HA?
Y
Y
Additional work
items from HA are
assigned.
Update record
and set status to
“resolved”
Y
N
END
END
HA? Incident Guidelines..
Severity 0,1: YES
Severity 2 : Maybe (instance & incident length?)
Functional Regression: Maybe
Incorrect/Incomplete Release: YES
Deployment Delayed or Rolled Back: Maybe
Impact to Customer/Production
or ability to release?
Tier 3 support
communicate
RCM to
customer(s)
N
R O O T C A U S E
A N A LY S I S W O R K F L O W
• Designed & implemented two
years ago
• Anchored the process around
the weekly “HA Forum”
• Intended to apply to all
incidents…
• In practice, focused on high
profile incidents
@kfinnbraun @jpaulreed#DOES15
11. Incident, Event,
Bug
Initial
Analysis
RC
Known?
from incident resolution.
Facilitator opens
investigations
and schedules
post mortem
meeting
Request RCA/
Failure Analysis
N
RC
Identified?
Identify corrective
actions and
implementation
plans; Assign
actions to scrum
teams
Y
RCM
Needed?
RCM
Process
Unable to
ascertain root
cause; update
record with “KE
Status”
Engage scrum
teams as required.
HA Forum
Y
N
Corrective
Actions
complete?
Weekly meetings
to follow up with
scrum master on
progress
Review
@HA?
Y
Y
Additional work
items from HA are
assigned.
Update record
and set status to
“resolved”
Y
N
END
END
HA? Incident Guidelines..
Severity 0,1: YES
Severity 2 : Maybe (instance & incident length?)
Functional Regression: Maybe
Incorrect/Incomplete Release: YES
Deployment Delayed or Rolled Back: Maybe
Impact to Customer/Production
or ability to release?
Tier 3 support
communicate
RCM to
customer(s)
N
@kfinnbraun @jpaulreed#DOES15
12. Incident, Event,
Bug
Initial
Analysis
RC
Known?
from incident resolution.
Facilitator opens
investigations
and schedules
post mortem
meeting
Request RCA/
Failure Analysis
N
RC
Identified?
Identify corrective
actions and
implementation
plans; Assign
actions to scrum
teams
Y
RCM
Needed?
RCM
Process
Unable to
ascertain root
cause; update
record with “KE
Status”
Engage scrum
teams as required.
HA Forum
Y
N
Corrective
Actions
complete?
Weekly meetings
to follow up with
scrum master on
progress
Review
@HA?
Y
Y
Additional work
items from HA are
assigned.
Update record
and set status to
“resolved”
Y
N
END
END
HA? Incident Guidelines..
Severity 0,1: YES
Severity 2 : Maybe (instance & incident length?)
Functional Regression: Maybe
Incorrect/Incomplete Release: YES
Deployment Delayed or Rolled Back: Maybe
Impact to Customer/Production
or ability to release?
Tier 3 support
communicate
RCM to
customer(s)
N
R O O T C A U S E
A N A LY S I S W O R K F L O W
I N R E A L I T Y
• Silo transition boundaries evident
in the workflow
• Some had little/no contact, via
the process, with other teams
required to perform their job
• Sampling of incident reports
uncovered consistent
inconsistencies
• The “Bermuda Blob”
@kfinnbraun @jpaulreed#DOES15
13. G E T T I N G A F E E L F O R T H E W E AT H E R
@kfinnbraun @jpaulreed#DOES15
15. H E A D F I R S T I N T O T H E S T O R M
@kfinnbraun @jpaulreed#DOES15
16. L A N G U A G E :
M AT T E R S
• “HA Forum” ➡ “WSRR”
• “WAR” (What is it good for?)
• Postmortem versus Retrospective
• Problem Team versus Solution
Team
• Root Cause versus Proximate
Cause
@kfinnbraun @jpaulreed#DOES15
17. B E H AV I O R :
M AT T E R S
• Intra-team behavior
• Inter-team behavior
• This is not “#NAFB”
• “People in complex systems create
safety. … The occasional human
contribution to failure occurs
because complex systems need an
overwhelming human contribution
for safety.” — Sydney Dekker
@kfinnbraun @jpaulreed#DOES15
18. S T R U C T U R E : M AT T E R S
@kfinnbraun @jpaulreed#DOES15
19. S T R U C T U R E : M AT T E R S
@kfinnbraun @jpaulreed#DOES15
20. “ B L A M E L E S S ”
“ P O S T M O R T E M S ” ?
• Brené Brown, research
sociologist, on vulnerability
• “Blame is a way to discharge
pain and discomfort”
• Postmortem has a heavy
connotation
• “Awesome postmortems?”
Really?!
@kfinnbraun @jpaulreed#DOES15
22. LanguageBehaviors
Novice Competent Proficient ExpertBeginner
“Incidents are bad;
my job is on the line”
“I’m getting sent to the
principal’s office because
of this outage”
Completes
the
post-incident
“paperwork”
No formal retrospective/
hallway retrospectives @kfinnbraun - #DOES15 - @jpaulreed
23. LanguageBehaviors
Novice Competent Proficient ExpertBeginner
“Incidents are bad;
my job is on the line”
“I’m getting sent to the
principal’s office because
of this outage”
“Let’s fix this as
fast as possible”
“What’s the correct fix to
avoid this specific issue
in the future?”
Completes
the
post-incident
“paperwork”
No formal retrospective/
hallway retrospectives
Some
information
(inconsistently)
recorded
Jump to a
focus on why
@kfinnbraun - #DOES15 - @jpaulreed
24. LanguageBehaviors
Novice Competent Proficient ExpertBeginner
“Incidents are bad;
my job is on the line”
“I’m getting sent to the
principal’s office because
of this outage”
“Let’s fix this as
fast as possible”
“What’s the correct fix to
avoid this specific issue
in the future?”
“Let’s review the
timeline/incident
report to answer that”
“We need to find the root
cause of this incident”
Completes
the
post-incident
“paperwork”
No formal retrospective/
hallway retrospectives
Some
information
(inconsistently)
recorded
Jump to a
focus on why
Follows the prescribed
format for retrospectives
Have and incorporate
complete dataset for the incident
into the retrospective
@kfinnbraun - #DOES15 - @jpaulreed
25. LanguageBehaviors
Novice Competent Proficient ExpertBeginner
“Incidents are bad;
my job is on the line”
“I’m getting sent to the
principal’s office because
of this outage”
“Let’s fix this as
fast as possible”
“What’s the correct fix to
avoid this specific issue
in the future?”
“Let’s review the
timeline/incident
report to answer that”
“We need to find the root
cause of this incident”
“Now that we’ve established
what happened,
how did it happen?”
“How did these
multiple factors
influence our
complex system?
Completes
the
post-incident
“paperwork”
No formal retrospective/
hallway retrospectives
Some
information
(inconsistently)
recorded
Jump to a
focus on why
Follows the prescribed
format for retrospectives
Have and incorporate
complete dataset for the incident
into the retrospective
Identifies
inherent bias
in self
and others
Perspectives solicited from all involved
team members/functional groups
@kfinnbraun - #DOES15 - @jpaulreed
26. LanguageBehaviors
Novice Competent Proficient ExpertBeginner
“Incidents are bad;
my job is on the line”
“I’m getting sent to the
principal’s office because
of this outage”
“Let’s fix this as
fast as possible”
“What’s the correct fix to
avoid this specific issue
in the future?”
“Let’s review the
timeline/incident
report to answer that”
“We need to find the root
cause of this incident”
“Now that we’ve established
what happened,
how did it happen?”
“How did these
multiple factors
influence our
complex system?
“How does our team/system
contribute to our successes?”
“What can we
incorporate from
this incident to
better respond
next time?”
Completes
the
post-incident
“paperwork”
No formal retrospective/
hallway retrospectives
Some
information
(inconsistently)
recorded
Jump to a
focus on why
Follows the prescribed
format for retrospectives
Have and incorporate
complete dataset for the incident
into the retrospective
Identifies
inherent bias
in self
and others
Perspectives solicited from all involved
team members/functional groups
Able to facilitate
retrospectives by
healthily helping
others address
tendency to blame/
personal & systemic bias
Retrospective outcomes
are fed back into the
system and prioritized
@kfinnbraun - #DOES15 - @jpaulreed
27. R E T R O S P E C T I V E S FA C I L I TAT E T H E
S E R V I C E ( A N D D E V E L O P M E N T ! )
I M P R O V E M E N T P R O C E S S
@kfinnbraun @jpaulreed#DOES15
28. B E I N G “ T O O B U S Y ” T O L E A R N
O R I M P R O V E M E A N S Y O U A R E I N
A D O W N WA R D S P I R A L ,
B Y D E F I N I T I O N
@kfinnbraun @jpaulreed#DOES15
29. I T ’ S N O T A B O U T T H E O U T C O M E .
I T ’ S A B O U T T H E R E S P O N S E .
@kfinnbraun @jpaulreed#DOES15
30. W H Y + H O W
I S M O R E I M P O R TA N T T H A N
W H AT
@kfinnbraun @jpaulreed#DOES15
31. Y O U A R E N E V E R D O N E .
@kfinnbraun @jpaulreed#DOES15
32. Y O U . A R E . N E V E R . D O N E .
@kfinnbraun @jpaulreed#DOES15
33. O U R F O R E C A S T
F O R T H E F U T U R E
• Evolving the concept of Service
Ownership
• Salesforce-specific
Retrospective Guides
• Global “live-site” coaching
• Refocus on getting the
business what it wants
@kfinnbraun @jpaulreed#DOES15
34. AV E N U E S F O R C O L L A B O R AT I O N
• How does the described Dreyfus model apply in
other organizations?
• Would love to hear stories from other enterprises
about their retrospective process, who does
them, and where they live within the organization
@kfinnbraun @jpaulreed#DOES15
36. P H O T O C R E D I T S
• Slide 1: https://en.wikipedia.org/wiki/File:Golden_Fog,_San_Francisco.jpg
• Slide 4: Courtesy Kevina Finn-Braun/Salesforce
• Slide 6: https://www.flickr.com/photos/hannaneh/6464986121
• Slide 7: https://www.youtube.com/watch?v=_DEToXsgrPc#t=1h5m50s
• Slide 13: http://kathmajp.weebly.com/all-movie-reviews/movie-review-twister
• Slide 14: http://thevane.gawker.com/heres-everything-they-got-wrong-and-right-in-the-
movi-1609968202
• Slide 15: https://www.flickr.com/photos/ravedelay/17761863929
@kfinnbraun @jpaulreed#DOES15
37. P H O T O C R E D I T S
• Slide 16: Screenshot of aviationweather.gov
• Slide 17: https://www.flickr.com/photos/ravedelay/17534032771/
• Slide 18: https://www.youtube.com/watch?v=8veT5QspylE#t=15m30s
• Slide 19: https://www.flickr.com/photos/jkirkhart35/4984385396
• Slide 20: https://www.youtube.com/watch?v=iCvmsMzlF7o
• Slide 33: https://commons.wikimedia.org/wiki/File:Rainbow_background.jpg
• Slide 35: https://en.wikipedia.org/wiki/File:Clouds_spilling_over_San_Francisco.jpg
@kfinnbraun @jpaulreed#DOES15