29. Risk Homeostasis
Black Swan
Unknown unknowns
Why Failure Happens
Source: http://www.apoliticus.com/wp-content/uploads/2009/01/6_21_080306_rumsfeld.jpg
Sunday, June 20, 2010
30. Risk Homeostasis
Black Swan
Unknown unknowns
Change
Why Failure Happens
Source: http://bozark.net/wordpress/wp-content/uploads/2008/09/barack_obama_change_fairey.jpg
Sunday, June 20, 2010
31. Risk Homeostasis
Black Swan
Unknown unknowns
Change
Many small failures
Why Failure Happens
Source: http://www.biojobblog.com/uploads/image/dominos.jpg
Sunday, June 20, 2010
32. Risk Homeostasis
Black Swan
Unknown unknowns
Change
Many small failures
Humans
Why Failure Happens
Source: http://www.librarian.net/talks/clc/CLC.key/SJ_Shoulder_Shrug.jpg
Sunday, June 20, 2010
36. Not unusual Not expected
Polisher
blocked
Moisture leaks into
air system
Source: http://www.gladwell.com/1996/1996_01_22_a_blowup.htm
Sunday, June 20, 2010
37. Not unusual
Polisher
blocked
Moisture leaks into
air system
Flow of cold water
stopped
Not expected Not good
Source: http://www.gladwell.com/1996/1996_01_22_a_blowup.htm
Sunday, June 20, 2010
38. Not unusual
Polisher
blocked
Moisture leaks into
air system
Flow of cold water
stopped
Not expected
Backup disabled
Source: http://www.gladwell.com/1996/1996_01_22_a_blowup.htm
Sunday, June 20, 2010
39. Not unusual
Polisher
blocked
Moisture leaks into
air system
Flow of cold water
stopped
Not expected
Backup disabled
Indicator blockedDoh!
Source: http://www.gladwell.com/1996/1996_01_22_a_blowup.htm
Sunday, June 20, 2010
40. Not unusual
Polisher
blocked
Moisture leaks into
air system
Flow of cold water
stopped
Not expected
Backup disabled
Indicator blocked
Relief valve broken
Doh!
Dammit
Source: http://www.gladwell.com/1996/1996_01_22_a_blowup.htm
Sunday, June 20, 2010
41. Not unusual
Polisher
blocked
Moisture leaks into
air system
Flow of cold water
stopped
Not expected
Backup disabled
Indicator blocked
Relief valve broken
Gauge broken
Doh!
Dammit
WTF
Source: http://www.gladwell.com/1996/1996_01_22_a_blowup.htm
Sunday, June 20, 2010
42. Not unusual
Polisher
blocked
Moisture leaks into
air system
Flow of cold water
stopped
Meltdown
Not expected
Backup disabled
Indicator blocked
Relief valve broken
Gauge broken
Doh!
Dammit
Source: http://www.gladwell.com/1996/1996_01_22_a_blowup.htm
Sunday, June 20, 2010
45. “accidental power failure”
Source: http://www.datacenterknowledge.com/archives/2010/06/16/power-failure-kos-intuit-sites-for-24-hours/
Sunday, June 20, 2010
46. “traffic accident damaged a nearby
utility transformer”
Source: http://www.datacenterknowledge.com/archives/2007/11/13/truck-crash-knocks-rackspace-offline/
Sunday, June 20, 2010
47. “unfortunate code change”
Source: http://www.datacenterknowledge.com/archives/2010/06/11/errant-code-change-crashes-10-million-blogs/
Sunday, June 20, 2010
49. “Unhappy customers may get some
attention, but unhappy networked
customers can quickly impact your
business”
-- Clay Shirky
Source: http://happenupon.files.wordpress.com/2009/02/technology-guru-clay-shir-001.jpg, http://scholarlykitchen.sspnet.org/2010/03/02/shirky-at-nfais-how-abundance-breaks-everything/
Sunday, June 20, 2010
63. Your site will fail
+
Downtime is bad
Sunday, June 20, 2010
64. Your site will fail
+
Downtime is bad
+
Everyone will find out
Sunday, June 20, 2010
65. Your site will fail
+
Downtime is bad
+
Everyone will find out
=
Screw it, I’ll become a
lumberjack
Source: http://sbadrinath.files.wordpress.com/2009/03/different26rqcu3.jpg
Sunday, June 20, 2010
66. “Embrace fear of outages and
degradation. Use it to guide your
architecture, your code, your
infrastructure. So lean into it.”
-- John Allspaw, VP Tech. Ops at Etsy
Sunday, June 20, 2010
77. “The larger issue here isn't just that a portion of
Facebook's platform has gone down - numerous web
services have issues from time to time, including
everything from Gmail to Twitter. An outage of this
length, however, with no official communication
from the company itself is disturbing.”
-- N.Y. Times
Sunday, June 20, 2010
121. 1. Communication channel
Something is
wrong
Can’t tell if it’s
me or you
I’ll assume it’s
you
You suck
CommunicatePrepare Explain
Sunday, June 20, 2010
122. Something is
wrong
Can’t tell if it’s
me or you
I’ll assume it’s
you
I know it’s you
Tell me when
you’re back
You suck a lot
less
CommunicatePrepare Explain
1. Communication channel
Sunday, June 20, 2010
134. 7 keys for public health dashboards
1. Must show current status for each “service”
2. Data must be accurate and timely
3. Must be easy to find
4. Must provide details for events in real time
5. Provide historical uptime and performance data
6. Provide a way to be notified of status changes
7. Provide details on the data is gathered
Source: http://www.transparentuptime.com/2008/11/rules-for-successful-public-health.html
Sunday, June 20, 2010
153. Prepare ExplainCommunicate
1. Postmortem
Admit failure
Sound like a human
Start time and end time
Source: https://groups.google.com/group/google-appengine/browse_thread/thread/a7640a2743922dcf
Sunday, June 20, 2010
154. Prepare ExplainCommunicate
1. Postmortem
Admit failure
Sound like a human
Start time and end time
Who/what was impacted
Source: http://techcrunch.com/2009/11/02/large-scale-downtime-at-rackspace-cloud/
Sunday, June 20, 2010
155. Prepare ExplainCommunicate
1. Postmortem
Admit failure
Sound like a human
Start time and end time
Who/what was impacted
What went wrong
Source: http://www.zendesk.com/2010/03/tuesday-double-whammy.html
Sunday, June 20, 2010
156. Prepare ExplainCommunicate
1. Postmortem
Admit failure
Sound like a human
Start time and end time
Who/what was impacted
What went wrong
Lessons learned
Source: http://graysky.org/2010/02/downtime-postmortem/
Sunday, June 20, 2010
158. Prepare ExplainCommunicate
“I was completely overwhelmed by
the amount of positive feedback and
support I received.”
Sunday, June 20, 2010
159. Prepare ExplainCommunicate
1. Postmortem
Admit failure
Sound like a human
Start time and end time
Who/what was impacted
What went wrong
Lessons learned
2. Improve for the future
Sunday, June 20, 2010
160. “Google is not just saying sorry, they are
actually implementing serious changes which
probably represents millions of dollars of
development to help make sure this doesn't
happen again.”
Prepare ExplainCommunicate
Source: http://news.ycombinator.com/item?id=1168493
Sunday, June 20, 2010
168. CommunicatePrepare Explain
1. Communication channel
- Easy to find
- Off-site
- Real-time
2. Process
- Give authority
- M.T.T.C.
- On-call/escalations
1. Communicate
- Use channel
- M.T.T.C.
- Who/what affected
- When started
- ETA to resolution
- Update regularly
2. Fix it!
1. Post-mortem
- Admit failure
- Sound like a human
- Start time and end time
- Who/what was impacted
- What went wrong
- Lessons learned
2. Learn and improve
Upside of Downtime Framework 1.0
Sunday, June 20, 2010
169. CommunicatePrepare Explain
1. Communication channel
- Easy to find
- Off-site
- Real-time
2. Process
- Give authority
- M.T.T.C.
- On-call/escalations
Upside of Downtime Framework 1.0
Be HumanBe TransparentBe Prepared + +
1. Post-mortem
- Admit failure
- Sound like a human
- Start time and end time
- Who/what was impacted
- What went wrong
- Lessons learned
2. Learn and improve
1. Communicate
- Use channel
- M.T.T.C.
- Who/what affected
- When started
- ETA to resolution
- Update regularly
2. Fix it!
Sunday, June 20, 2010
170. CommunicatePrepare Explain
1. Communication channel
- Easy to find
- Off-site
- Real-time
2. Process
- Give authority
- M.T.T.C.
- On-call/escalations
Upside of Downtime Framework 1.0
Be HumanBe TransparentBe Prepared + +
Trust
=
1. Post-mortem
- Admit failure
- Sound like a human
- Start time and end time
- Who/what was impacted
- What went wrong
- Lessons learned
2. Learn and improve
1. Communicate
- Use channel
- M.T.T.C.
- Who/what affected
- When started
- ETA to resolution
- Update regularly
2. Fix it!
Sunday, June 20, 2010
179. Benefits
Gain trust
Reduce churn, increase loyalty
Reduce support costs
Ability to control the message
Competitive advantage
More time to focus on the actual problem
Reduce stress
Sunday, June 20, 2010
183. Keys to Adoption
Getting past a culture of “hide the problem”
Overriding commitment to want to improve
Sunday, June 20, 2010
184. Keys to Adoption
Getting past a culture of “hide the problem”
Overriding commitment to want to improve
Available resources to improve
Sunday, June 20, 2010
185. Keys to Adoption
Getting past a culture of “hide the problem”
Overriding commitment to want to improve
Available resources to improve
Pain
Sunday, June 20, 2010
186. Keys to Adoption
Getting past a culture of “hide the problem”
Overriding commitment to want to improve
Available resources to improve
Pain
Buy-in
Sunday, June 20, 2010
191. Product
Management
Support
Reality: Proactiveness => Forgiveness
Default: Too much work
Reality: More upfront, less when it matters
Sales/
Marketing
Default: Lets wait for complaints
Engineering/
Operations
Sunday, June 20, 2010
192. Product
Management
Support
Reality: Proactiveness => Forgiveness
Default: Too much work
Reality: More upfront, less when it matters
Default: Don’t want to look bad
Sales/
Marketing
Default: Lets wait for complaints
Engineering/
Operations
Sunday, June 20, 2010
194. Product
Management
Support
Reality: Proactiveness => Forgiveness
Default: Too much work
Reality: More upfront, less when it matters
Default: Don’t want to look bad
Reality: Opportunity to learn/improve
Default: I don’t want my customers to knowSales/
Marketing
Default: Lets wait for complaints
Engineering/
Operations
Sunday, June 20, 2010
195. Product
Management
Support
Reality: Proactiveness => Forgiveness
Default: Too much work
Reality: More upfront, less when it matters
Default: Don’t want to look bad
Reality: Opportunity to learn/improve
Default: I don’t want my customers to know
Reality: They’ll find out, better from us
Sales/
Marketing
Default: Lets wait for complaints
Engineering/
Operations
Sunday, June 20, 2010
196. Product
Management
Support
Reality: Proactiveness => Forgiveness
Default: Too much work
Reality: More upfront, less when it matters
Default: Don’t want to look bad
Reality: Opportunity to learn/improve
Default: I don’t want my customers to know
Reality: They’ll find out, better from us
Sales/
Marketing
Default: Lets wait for complaints
Engineering/
Operations
Sunday, June 20, 2010
200. “The measure of a society is how
well it transforms pain and suffering
into something worthwhile.”
-- Fredrick Nietzsche
Sunday, June 20, 2010
201. “The measure of a company is how
well it transforms pain of downtime
into something worthwhile.”
-- Lenny Rachitsky
Source: Original quote inspired by Fredrick Nietzsche
Sunday, June 20, 2010
207. CommunicatePrepare Explain
1. Communication channel
- Easy to find
- Off-site
- Real-time
2. Process
- Give authority
- M.T.T.C.
- On-call/escalations
1. Communicate
- Use channel
- M.T.T.C.
- Who/what affected
- When started
- ETA to resolution
- Update regularly
2. Fix it!
1. Post-mortem
- Admit failure
- Sound like a human
- Start time and end time
- Who/what was impacted
- What went wrong
- Lessons learned
2. Learn and improve
Upside of Downtime Framework 1.0
Sunday, June 20, 2010
208. CommunicatePrepare Explain
1. Communication channel
- Easy to find
- Off-site
- Real-time
2. Process
- Give authority
- M.T.T.C.
- On-call/escalations
1. Communicate
- Use channel
- M.T.T.C.
- Who/what affected
- When started
- ETA to resolution
- Update regularly
2. Fix it!
1. Post-mortem
- Admit failure
- Sound like a human
- Start time and end time
- Who/what was impacted
- What went wrong
- Lessons learned
2. Learn and improve
Upside of Downtime Framework 1.0
Sunday, June 20, 2010
209. CommunicatePrepare Explain
1. Communication channel
- Easy to find
- Off-site
- Real-time
2. Process
- Give authority
- M.T.T.C.
- On-call/escalations
1. Communicate
- Use channel
- M.T.T.C.
- Who/what affected
- When started
- ETA to resolution
- Update regularly
2. Fix it!
1. Post-mortem
- Admit failure
- Sound like a human
- Start time and end time
- Who/what was impacted
- What went wrong
- Lessons learned
2. Learn and improve
Upside of Downtime Framework 1.0
Sunday, June 20, 2010
210. CommunicatePrepare Explain
1. Communication channel
- Easy to find
- Off-site
- Real-time
2. Process
- Give authority
- M.T.T.C.
- On-call/escalations
1. Communicate
- Use channel
- M.T.T.C.
- Who/what affected
- When started
- ETA to resolution
- Update regularly
2. Fix it!
1. Post-mortem
- Admit failure
- Sound like a human
- Start time and end time
- Who/what was impacted
- What went wrong
- Lessons learned
2. Learn and improve
Upside of Downtime Framework 1.0
"Unlikely that an accidental surface or subsurface
oil spill would occur from the proposed activities"
-- Exploration and environmental impact plan
Source: http://en.wikipedia.org/wiki/Deepwater_Horizon_drilling_rig_explosion
Sunday, June 20, 2010
211. CommunicatePrepare Explain
1. Communication channel
- Easy to find
- Off-site
- Real-time
2. Process
- Give authority
- M.T.T.C.
- On-call/escalations
1. Communicate
- Use channel
- M.T.T.C.
- Who/what affected
- When started
- ETA to resolution
- Update regularly
2. Fix it!
1. Post-mortem
- Admit failure
- Sound like a human
- Start time and end time
- Who/what was impacted
- What went wrong
- Lessons learned
2. Learn and improve
Upside of Downtime Framework 1.0
Sunday, June 20, 2010
212. CommunicatePrepare Explain
1. Communication channel
- Easy to find
- Off-site
- Real-time
2. Process
- Give authority
- M.T.T.C.
- On-call/escalations
1. Communicate
- Use channel
- M.T.T.C.
- Who/what affected
- When started
- ETA to resolution
- Update regularly
2. Fix it!
1. Post-mortem
- Admit failure
- Sound like a human
- Start time and end time
- Who/what was impacted
- What went wrong
- Lessons learned
2. Learn and improve
Upside of Downtime Framework 1.0
Sunday, June 20, 2010
213. CommunicatePrepare Explain
1. Communication channel
- Easy to find
- Off-site
- Real-time
2. Process
- Give authority
- M.T.T.C.
- On-call/escalations
1. Communicate
- Use channel
- M.T.T.C.
- Who/what affected
- When started
- ETA to resolution
- Update regularly
2. Fix it!
1. Post-mortem
- Admit failure
- Sound like a human
- Start time and end time
- Who/what was impacted
- What went wrong
- Lessons learned
2. Learn and improve
Upside of Downtime Framework 1.0
Sunday, June 20, 2010
214. CommunicatePrepare Explain
1. Communication channel
- Easy to find
- Off-site
- Real-time
2. Process
- Give authority
- M.T.T.C.
- On-call/escalations
1. Communicate
- Use channel
- M.T.T.C.
- Who/what affected
- When started
- ETA to resolution
- Update regularly
2. Fix it!
1. Post-mortem
- Admit failure
- Sound like a human
- Start time and end time
- Who/what was impacted
- What went wrong
- Lessons learned
2. Learn and improve
Upside of Downtime Framework 1.0
Sunday, June 20, 2010
215. CommunicatePrepare Explain
1. Communication channel
- Easy to find
- Off-site
- Real-time
2. Process
- Give authority
- M.T.T.C.
- On-call/escalations
1. Communicate
- Use channel
- M.T.T.C.
- Who/what affected
- When started
- ETA to resolution
- Update regularly
2. Fix it!
1. Post-mortem
- Admit failure
- Sound like a human
- Start time and end time
- Who/what was impacted
- What went wrong
- Lessons learned
2. Learn and improve
Upside of Downtime Framework 1.0
Sunday, June 20, 2010
216. CommunicatePrepare Explain
1. Communication channel
- Easy to find
- Off-site
- Real-time
2. Process
- Give authority
- M.T.T.C.
- On-call/escalations
1. Communicate
- Use channel
- M.T.T.C.
- Who/what affected
- When started
- ETA to resolution
- Update regularly
2. Fix it!
1. Post-mortem
- Admit failure
- Sound like a human
- Start time and end time
- Who/what was impacted
- What went wrong
- Lessons learned
2. Learn and improve
Upside of Downtime Framework 1.0
Sunday, June 20, 2010
217. CommunicatePrepare Explain
1. Communication channel
- Easy to find
- Off-site
- Real-time
2. Process
- Give authority
- M.T.T.C.
- On-call/escalations
1. Communicate
- Use channel
- M.T.T.C.
- Who/what affected
- When started
- ETA to resolution
- Update regularly
2. Fix it!
1. Post-mortem
- Admit failure
- Sound like a human
- Start time and end time
- Who/what was impacted
- What went wrong
- Lessons learned
2. Learn and improve
Upside of Downtime Framework 1.0
Sunday, June 20, 2010
218. CommunicatePrepare Explain
1. Communication channel
- Easy to find
- Off-site
- Real-time
2. Process
- Give authority
- M.T.T.C.
- On-call/escalations
1. Communicate
- Use channel
- M.T.T.C.
- Who/what affected
- When started
- ETA to resolution
- Update regularly
2. Fix it!
1. Post-mortem
- Admit failure
- Sound like a human
- Start time and end time
- Who/what was impacted
- What went wrong
- Lessons learned
2. Learn and improve
Upside of Downtime Framework 1.0
Sunday, June 20, 2010
219. “Be not afraid of transparency;
some are born transparent,
some achieve transparency,
and others have transparency
thrust upon them.”
-- Burrowed from William Shakespeare
Sunday, June 20, 2010
221. Making change
1. Find the bright spots - (this presentation has a bunch)
Sunday, June 20, 2010
222. Making change
1. Find the bright spots - (this presentation has a bunch)
2. Script the critical moves - (framework)
Sunday, June 20, 2010
223. Making change
1. Find the bright spots - (this presentation has a bunch)
2. Script the critical moves - (framework)
3. Point to the destination - (W.W.G.D.)
Sunday, June 20, 2010
224. Making change
1. Find the bright spots - (this presentation has a bunch)
2. Script the critical moves - (framework)
3. Point to the destination - (W.W.G.D.)
4. Find the feeling - (how would you feel?)
Sunday, June 20, 2010
225. Making change
1. Find the bright spots - (this presentation has a bunch)
2. Script the critical moves - (framework)
3. Point to the destination - (W.W.G.D.)
4. Find the feeling - (how would you feel?)
5. Shrink the change - (start small)
Sunday, June 20, 2010
226. Making change
1. Find the bright spots - (this presentation has a bunch)
2. Script the critical moves - (framework)
3. Point to the destination - (W.W.G.D.)
4. Find the feeling - (how would you feel?)
5. Shrink the change - (start small)
6. Grow your people - (everyone is learning as they go)
Sunday, June 20, 2010
227. Making change
1. Find the bright spots - (this presentation has a bunch)
2. Script the critical moves - (framework)
3. Point to the destination - (W.W.G.D.)
4. Find the feeling - (how would you feel?)
5. Shrink the change - (start small)
6. Grow your people - (everyone is learning as they go)
7. Tweak the environment - (create a simple process)
Sunday, June 20, 2010
228. Making change
1. Find the bright spots - (this presentation has a bunch)
2. Script the critical moves - (framework)
3. Point to the destination - (W.W.G.D.)
4. Find the feeling - (how would you feel?)
5. Shrink the change - (start small)
6. Grow your people - (everyone is learning as they go)
7. Tweak the environment - (create a simple process)
8. Build habits - (build process organically)
Sunday, June 20, 2010
229. Making change
1. Find the bright spots - (this presentation has a bunch)
2. Script the critical moves - (framework)
3. Point to the destination - (W.W.G.D.)
4. Find the feeling - (how would you feel?)
5. Shrink the change - (start small)
6. Grow your people - (everyone is learning as they go)
7. Tweak the environment - (create a simple process)
8. Build habits - (build process organically)
9. Rally the herd - (get buy in, rest will follow)
Sunday, June 20, 2010