Mistakes and failure are inevitable. Instead of being afraid of them, we should use them as lessons that help identify weak points in our organisations and systems. One way to do this is by writing blameless postmortems. Daniel details exactly how postmortems can help organizations and teams focus on improvement, and how that boosts work morale, makes products better, and strengthens your relationship with customers.
10. “A postmortem is a written record of an
incident, its impact, the actions taken to
mitigate or resolve it, the root cause(s),
and the follow-up actions to prevent the
incident from recurring.”
17. Blame, Sanctions And Accountability
Issue (root cause): “A backup script was run against the production
database. It locked all tables and caused downtime.”
18. Issue (root cause): “A backup script was run against the production
database. It locked all tables and caused downtime.”
BAD!
Blame, Sanctions And Accountability
19. Issue (root cause): “A backup script was run against the production
database. It locked all tables and caused downtime.”
BAD!
Issue (root cause): “I (Daniel) ran a backup script against the production
database. It locked all tables and caused downtime. Why is this script not
configured to use --single-transaction for InnoDB tables?”
Blame, Sanctions And Accountability
20. Issue (root cause): “A backup script was run against the production
database. It locked all tables and caused downtime.”
BAD!
Issue (root cause): “I (Daniel) ran a backup script against the production
database. It locked all tables and caused downtime. Why is this script not
configured to use --single-transaction for InnoDB tables?”
BETTER!
Blame, Sanctions And Accountability
29. Table Timeline Example
03:45 AM Monitoring system detected high rate of 5xx errors
03:46 AM Monitoring system paged engineer on call
30. Table Timeline Example
03:45 AM Monitoring system detected high rate of 5xx errors
03:46 AM Monitoring system paged engineer on call
03:47 AM Incident was confirmed
31. Table Timeline Example
03:45 AM Monitoring system detected high rate of 5xx errors
03:46 AM Monitoring system paged engineer on call
03:47 AM Incident was confirmed
03:53 AM
Graphs were checked and 10 times increase in traffic
towards Redis was observed
32. Table Timeline Example
03:45 AM Monitoring system detected high rate of 5xx errors
03:46 AM Monitoring system paged engineer on call
03:47 AM Incident was confirmed
03:53 AM
Graphs were checked and 10 times increase in traffic
towards Redis was observed
04:25 AM Issue was escalated to a senior engineer
33. Table Timeline Example
03:45 AM Monitoring system detected high rate of 5xx errors
03:46 AM Monitoring system paged engineer on call
03:47 AM Incident was confirmed
03:53 AM
Graphs were checked and 10 times increase in traffic
towards Redis was observed
04:25 AM Issue was escalated to a senior engineer
04:52 AM WordPress plugin was downgraded to fix the issue
34. Post-mortem template. The obvious stuff.
Describe the incident and the impact?
How was it solved?
Complete timeline of events.
Root Cause(s) Analysis?
1
2
3
4
35. Post-mortem template. The obvious stuff.
Describe the incident and the impact?
How was it solved?
Complete timeline of events.
Root Cause(s) Analysis?
Lessons learned.
1
2
3
4
5
36. Post-mortem template. The obvious stuff.
Describe the incident and the impact?
How was it solved?
Complete timeline of events.
Root Cause(s) Analysis?
Lessons learned.
Action Item List.
1
2
3
4
5
6
37. Post-mortem template. The obvious stuff.
Describe the incident and the impact?
How was it solved?
Complete timeline of events.
Root Cause(s) Analysis?
Lessons learned.
Action Item List.
Post-Mortem Review and Approval.
1
2
3
4
5
6
7
46. Post-mortem template. The hidden gems.
Different Triggers/Contributors.
Mitigators.
Additions to the Timeline of Events.
Islands of Knowledge.
1
2
3
4
47. Post-mortem template. The hidden gems.
Different Triggers/Contributors.
Mitigators.
Additions to the Timeline of Events.
Islands of Knowledge.
Open discussions.
1
2
3
4
5